目录

前言

最近在使用深度学习,跑了一个大的model,然后GPU炸了,上网搜索了一下如何解决这个问题,做下笔记,分享给大家。
keras在使用GPU的时候有个特点,就是默认全部占满显存。 这样如果有多个模型都需要使用GPU跑的话,那么限制是很大的,而且对于GPU也是一种浪费。因此在使用keras时需要有意识的设置运行时使用那块显卡,需要使用多少容量。

具体可以分为以下三种情况:

  1. 指定显卡
  2. 限制GPU用量
  3. 即指定显卡又限制GPU用量

查看GPU使用情况语句(linux)

# 1秒钟刷新一次
watch -n 1 nvidia-smi


一、指定显卡

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

这里指定了使用编号为2的GPU,大家可以根据需要和实际情况来指定需要使用的GPU。

二、限制GPU用量

1、设置使用GPU的百分比

import tensorflow as tf
import keras.backend.tensorflow_backend as KTF

进行配置,使用30%的GPU

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
session = tf.Session(config=config)

设置session

KTF.set_session(session )

需要注意的是,虽然代码或配置层面设置了对显存占用百分比阈值,但在实际运行中如果达到了这个阈值,程序有需要的话还是会突破这个阈值。换而言之如果跑在一个大数据集上还是会用到更多的显存。以上的显存限制仅仅为了在跑小数据集时避免对显存的浪费而已。

2、GPU按需使用

import tensorflow as tf
import keras.backend.tensorflow_backend as KTF

config = tf.ConfigProto()  
config.gpu_options.allow_growth=True   #不全部占满显存, 按需分配
session = tf.Session(config=config)

# 设置session
KTF.set_session(sess)

三、指定GPU并且限制GPU用量

这个比较简单,就是讲上面两种情况连上即可。。。

import os
import tensorflow as tf
import keras.backend.tensorflow_backend as KTF

指定第一块GPU可用

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

config = tf.ConfigProto()  
config.gpu_options.allow_growth=True   #不全部占满显存, 按需分配
sess = tf.Session(config=config)

KTF.set_session(sess)

keras在使用GPU的时候有个特点,就是默认全部占满显存。
若单核GPU也无所谓,若是服务器GPU较多,性能较好,全部占满就太浪费了。
于是乎有以下三种情况:
在使用keras时候会出现总是占满GPU显存的情况,可以通过重设backend的GPU占用情况来进行调节。

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))

上述两个连一起用就行:

import os
import tensorflow as tf
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))
CUDA_VISIBLE_DEVICES=0 python -m nmt.nmt 

这个问题在GitHub上已经摞了很高的,貌似是Windows特有的,而且和显存容量有关。最后有一位宛如救世主的老兄给出了他的总结性发言与变相的解决方案

    Here is a bit more info on how I temporarily resolved it. I believe these issues are all related to GPU memory allocation and have nothing to do with the errors being reported. There were other errors before this indicating some sort of memory allocation problem but the program continued to progress, eventually giving the cudnn errors that everyone is getting. The reason I believe it works sometimes is that if you use the gpu for other things besides tensorflow such as your primary display, the available memory fluctuates. Sometimes you can allocate what you need and other times it can’t.
From the API
https://www.tensorflow.org/versions/r0.12/how_tos/using_gpu/
```
“By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.”

I think this default allocation is broken in some way that causes this erratic behavior and certain situations to work and others to fail.

I have resolved this issue by changing the default behavior of TF to allocate a minimum amount of memory and grow as needed as detailed in the webpage.
```
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, …)
    I have also tried the alternate way and was able to get it to work and fail with experimentally choosing a percentage that worked. In my case it ended up being about .7.

   

 config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = 0.4
    session = tf.Session(config=config, …)
Still no word from anyone on the TF team confirming this but it is worth a shot to see if others can confirm similar behavior.

答疑

如果还有其他问题,可以关注公众号,答主会在24h之内回复你。
在这里插入图片描述

GitHub 加速计划 / te / tensorflow
184.55 K
74.12 K
下载
一个面向所有人的开源机器学习框架
最近提交(Master分支:2 个月前 )
a49e66f2 PiperOrigin-RevId: 663726708 2 个月前
91dac11a This test overrides disabled_backends, dropping the default value in the process. PiperOrigin-RevId: 663711155 2 个月前
Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐