tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
nohup python train_rcnn.py &
进行目标检测模型的训练,绝大部分情况下开始甚至好一段时间都OK,
可是,,,,
训练过程中,有时候训练刚开始,有时候训练经过了几个小时之后才会发生如下错误:
GPU日志?
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: slice index 1 of dimension 0 out of bounds.
[[{{node build_head_train_sample/strided_slice_2}}]]
(1) Invalid argument: slice index 1 of dimension 0 out of bounds.
[[{{node build_head_train_sample/strided_slice_2}}]]
[[tower_2/optimizer/clip_by_norm_277/truediv/_13935]]
0 successful operations.
3 derived errors ignored.
或者:
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: slice index 1 of dimension 0 out of bounds.
有人这么解决:
经过多方搜索,得到结果是由于cuda+cuDNN+TensorFlow版本不一致导致,
因此我即将Tensorflow升级到了2.0(我当时不能将Tensorflow降级,降级后就使用不了GPU了)。
但还是报此错误,最后在他人博客上找到了问题所在,显存分配问题,
更改为动态分配内存就可以解决。
在训练的脚本开头添加以下代码即可进行内存的动态分配,重新运行即可。
import keras
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True # TensorFlow按需分配显存
config.gpu_options.per_process_gpu_memory_fraction = 0.5 # 指定显存分配比例
keras.backend.tensorflow_backend.set_session(tf.Session(config=config))
也有Mask RCNN的github issue中给出的策略:
config.IMAGES_PER_GPU = 1
config.GPU_COUNT = 1
config.BATCH_SIZE = 1
不管用
也有人给出因为tensorflow或者numpy版本引起:
I had the same issue. It seems that updating to tensorflow 1.15.0 solves it.
I also found that limiting the gpu memory growth could also help,
however it did slow down the training speed a lot.
Try downgrading your numpy version. In my case, i had to downgrade it to 1.17.4
.....
.............
实践证明:
session_config.gpu_options.allow_growth = True
session_config.allow_soft_placement = True
session_config.log_device_placement = True
session_config.gpu_options.per_process_gpu_memory_fraction = 0.9
有效!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
所以,本质上不是在训练的时候发生的这个问题,
而是在训练过程的验证(validation)环节出现的问题。
参考:tensorflow
参考:githu
更多推荐
所有评论(0)