tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.

Data+Science+Insight

5993人浏览 · 2021-05-05 07:34:11

Data+Science+Insight · 2021-05-05 07:34:11 发布

nohup python train_rcnn.py &

进行目标检测模型的训练，绝大部分情况下开始甚至好一段时间都OK，

可是，，，，
训练过程中，有时候训练刚开始，有时候训练经过了几个小时之后才会发生如下错误：

GPU日志？

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: slice index 1 of dimension 0 out of bounds.
   [[{{node build_head_train_sample/strided_slice_2}}]]
(1) Invalid argument: slice index 1 of dimension 0 out of bounds.
   [[{{node build_head_train_sample/strided_slice_2}}]]
   [[tower_2/optimizer/clip_by_norm_277/truediv/_13935]]
0 successful operations.
3 derived errors ignored.

或者：

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: slice index 1 of dimension 0 out of bounds.

有人这么解决：
经过多方搜索，得到结果是由于cuda+cuDNN+TensorFlow版本不一致导致，
因此我即将Tensorflow升级到了2.0（我当时不能将Tensorflow降级，降级后就使用不了GPU了）。
但还是报此错误，最后在他人博客上找到了问题所在，显存分配问题，
更改为动态分配内存就可以解决。
在训练的脚本开头添加以下代码即可进行内存的动态分配，重新运行即可。

import keras
import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.allow_growth = True # TensorFlow按需分配显存
config.gpu_options.per_process_gpu_memory_fraction = 0.5 # 指定显存分配比例
keras.backend.tensorflow_backend.set_session(tf.Session(config=config))

也有Mask RCNN的github issue中给出的策略：
config.IMAGES_PER_GPU = 1
config.GPU_COUNT = 1
config.BATCH_SIZE = 1
不管用

也有人给出因为tensorflow或者numpy版本引起：

I had the same issue. It seems that updating to tensorflow 1.15.0 solves it.
I also found that limiting the gpu memory growth could also help,
however it did slow down the training speed a lot.
Try downgrading your numpy version. In my case, i had to downgrade it to 1.17.4
.....
.............

实践证明：
session_config.gpu_options.allow_growth = True
session_config.allow_soft_placement = True
session_config.log_device_placement = True
session_config.gpu_options.per_process_gpu_memory_fraction = 0.9

有效！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！
所以，本质上不是在训练的时候发生的这个问题，
而是在训练过程的验证（validation）环节出现的问题。

参考：tensorflow

参考：githu