【PaddleOCR】训练端到端OCR模型PGNet报错超出最大递归深度:maximum recursion depth exceeded while decoding a JSON array...
问题
环境:
- aistudio A100
- python 3.7.4
- PaddlePaddle 2.4.0
- PaddleOCR2.6.0
在此环境下进行PGNet模型训练,报错如下:
......
[2023/07/06 14:31:40] ppocr INFO: epoch: [2/600], global_step: 950, lr: 0.001000, loss: 1.698581, score_loss: 0.074852, border_loss: 0.025947, direction_loss: 0.014492, ctc_loss: 0.318290, avg_reader_cost: 0.00038 s, avg_batch_cost: 0.55773 s, avg_samples: 14.0, ips: 25.10191 samples/s, eta: 2 days, 7:34:16
[2023/07/06 14:31:46] ppocr INFO: epoch: [2/600], global_step: 960, lr: 0.001000, loss: 1.529947, score_loss: 0.074622, border_loss: 0.025173, direction_loss: 0.014684, ctc_loss: 0.283169, avg_reader_cost: 0.00038 s, avg_batch_cost: 0.55826 s, avg_samples: 14.0, ips: 25.07770 samples/s, eta: 2 days, 7:32:33
[2023/07/06 14:31:51] ppocr INFO: epoch: [2/600], global_step: 970, lr: 0.001000, loss: 1.508448, score_loss: 0.073050, border_loss: 0.024964, direction_loss: 0.014839, ctc_loss: 0.278475, avg_reader_cost: 0.00037 s, avg_batch_cost: 0.55947 s, avg_samples: 14.0, ips: 25.02383 samples/s, eta: 2 days, 7:30:56
[2023/07/06 14:31:57] ppocr INFO: epoch: [2/600], global_step: 980, lr: 0.001000, loss: 1.370401, score_loss: 0.070708, border_loss: 0.024799, direction_loss: 0.014229, ctc_loss: 0.254010, avg_reader_cost: 0.00036 s, avg_batch_cost: 0.55940 s, avg_samples: 14.0, ips: 25.02692 samples/s, eta: 2 days, 7:29:21
[2023/07/06 14:32:03] ppocr INFO: epoch: [2/600], global_step: 990, lr: 0.001000, loss: 1.189384, score_loss: 0.070311, border_loss: 0.025831, direction_loss: 0.014544, ctc_loss: 0.215893, avg_reader_cost: 0.00036 s, avg_batch_cost: 0.55872 s, avg_samples: 14.0, ips: 25.05728 samples/s, eta: 2 days, 7:27:45
[2023/07/06 14:32:08] ppocr INFO: epoch: [2/600], global_step: 1000, lr: 0.001000, loss: 1.111022, score_loss: 0.072967, border_loss: 0.026057, direction_loss: 0.014936, ctc_loss: 0.200845, avg_reader_cost: 0.00036 s, avg_batch_cost: 0.55829 s, avg_samples: 14.0, ips: 25.07636 samples/s, eta: 2 days, 7:26:10
eval model:: 0%| | 0/2000 [00:00<?, ?it/s][2023/07/06 14:32:18] ppocr ERROR: When parsing line 1979, error happened with msg: maximum recursion depth exceeded while decoding a JSON array from a unicode string
[2023/07/06 14:32:18] ppocr ERROR: When parsing line 292, error happened with msg: maximum recursion depth exceeded while decoding a JSON array from a unicode string
Fatal Python error: Cannot recover from stack overflow.
Current thread 0x00007fa0d4681700 (most recent call first):
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/json/decoder.py", line 353 in raw_decode
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/json/decoder.py", line 337 in decode
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/json/__init__.py", line 348 in loads
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/imaug/label_ops.py", line 208 in __call__
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/imaug/__init__.py", line 53 in transform
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 95 in __getitem__
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
File "/home/aistudio/PaddleOCR-2.6.0/ppocr/data/pgnet_dataset.py", line 102 in __getitem__
......
训练配置每1000step会eval一次,可以看到报错就发生在eval中
maximum recursion depth exceeded while decoding a JSON array from a unicode string
在解析JSON数组时超出递归最大深度,结合下面多次__getitem__的调用栈可以肯定递归是主要问题,是什么导致了深度的递归?
排查
sys包中有两个与递归有关的api:
sys.getrecursionlimit()
sys.setrecursionlimit(limit)
分别是获取和设置当前python解释器的递归深度
单纯的加大递归深度八成是治标不治本的方法,而且最大递归深度还受到当前平台的限制:
sys.setrecursionlimit(limit)
Set the maximum depth of the Python interpreter stack to limit. This limit prevents infinite recursion from causing an overflow of the C stack and crashing Python.
The highest possible limit is platform-dependent. A user may need to set the limit higher when they have a program that requires deep recursion and a platform that supports a higher limit. This should be done with care, because a too-high limit can lead to a crash.
The highest possible limit is platform-dependent.
那么当前平台的递归深度默认是多少呢?我在windows和aistudio平台尝试,结果都为1000:
>>> sys.getrecursionlimit()
1000
分析
当搜索无果时,走查代码是最直接的办法,从train.py的main方法一路看下去。。。但是我并没有正向找到调用pgnet_dataset的__getitem__方法的路径,只知道是在model中:
# program.py 525行
elif model_type in ['sr']:
preds = model(batch)
sr_img = preds["sr_img"]
lr_img = preds["lr_img"]
else:
preds = model(images)
batch_numpy = []
for item in batch:
if isinstance(item, paddle.Tensor):
batch_numpy.append(item.numpy())
else:
batch_numpy.append(item)
只能逆向操作了,问题的关键代码位于"ppocr/data/pgnet_dataset.py, line 102 in getitem":
def __getitem__(self, idx):
file_idx = self.data_idx_order_list[idx]
data_line = self.data_lines[file_idx]
img_id = 0
try:
data_line = data_line.decode('utf-8')
substr = data_line.strip("\n").split(self.delimiter)
file_name = substr[0]
label = substr[1]
img_path = os.path.join(self.data_dir, file_name)
if self.mode.lower() == 'eval':
try:
img_id = int(data_line.split(".")[0][7:])
except:
img_id = 0
data = {'img_path': img_path, 'label': label, 'img_id': img_id}
if not os.path.exists(img_path):
raise Exception("{} does not exist!".format(img_path))
with open(data['img_path'], 'rb') as f:
img = f.read()
data['image'] = img
outs = transform(data, self.ops)
except Exception as e:
self.logger.error(
"When parsing line {}, error happened with msg: {}".format(
self.data_idx_order_list[idx], e))
outs = None
if outs is None:
return self.__getitem__(np.random.randint(self.__len__()))
return outs
递归行为:
return self.__getitem__(np.random.randint(self.__len__()))
可以看到,函数开始在读取数据,之后进行了一个transform转换得到outs,如果outs等于None就会进入递归,在流程上这里取一张数据集中的图片,如果取不到就随机拿一张图片来补上。
发生了如此多的递归,表明outs一直为None,增加日志排查发现transform方法一直返回None。transform方法并不复杂:
def transform(data, ops=None):
""" transform """
if ops is None:
ops = []
for op in ops:
data = op(data)
if data is None:
return None
return data
transform循环调用op来操作data,什么是op?排查代码发现ops在__init__中:
self.ops = create_operators(dataset_config['transforms'], global_config)
进一步排查代码发现,ops就是读取配置文件的transforms配置来构造对象,那么transforms配置是什么呢?如下:
Eval:
dataset:
name: PGDataSet
data_dir: ./train_data/total_text/test
label_file_list: [./train_data/total_text/test/test.txt]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- E2ELabelEncodeTest:
- E2EResizeForTest:
max_side_len: 768
- NormalizeImage:
scale: 1./255.
mean: [ 0.485, 0.456, 0.406 ]
std: [ 0.229, 0.224, 0.225 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
keep_keys: [ 'image', 'shape', 'polys', 'texts', 'ignore_tags', 'img_id']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 1 # must be 1
num_workers: 2
此处为Eval的transforms配置,可以看到其中有DecodeImage、E2ELabelEncodeTest、E2EResizeForTest…等等,在transform方法中,如果op返回None的话transform就会返回None:
def transform(data, ops=None):
""" transform """
if ops is None:
ops = []
for op in ops:
data = op(data)
if data is None:
return None
return data
从而外层的outs为None,进入递归,那么是在哪个op失败了呢?增加日志发现当返回None时,op全部为E2ELabelEncodeTest,也就是说全部data在E2ELabelEncodeTest失败了。
那么接下来重点分析一下这个E2ELabelEncodeTest是干嘛的,E2ELabelEncodeTest代码如下:
class E2ELabelEncodeTest(BaseRecLabelEncode):
def __init__(self,
max_text_length,
character_dict_path=None,
use_space_char=False,
**kwargs):
super(E2ELabelEncodeTest, self).__init__(
max_text_length, character_dict_path, use_space_char)
def __call__(self, data):
import json
padnum = len(self.dict)
label = data['label']
label = json.loads(label)
nBox = len(label)
boxes, txts, txt_tags = [], [], []
for bno in range(0, nBox):
box = label[bno]['points']
txt = label[bno]['transcription']
boxes.append(box)
txts.append(txt)
if txt in ['*', '###']:
txt_tags.append(True)
else:
txt_tags.append(False)
boxes = np.array(boxes, dtype=np.float32)
txt_tags = np.array(txt_tags, dtype=np.bool)
data['polys'] = boxes
data['ignore_tags'] = txt_tags
temp_texts = []
for text in txts:
text = text.lower()
text = self.encode(text)
if text is None:
return None
text = text + [padnum] * (self.max_text_len - len(text)
) # use 36 to pad
temp_texts.append(text)
data['texts'] = np.array(temp_texts)
return data
可以看到E2ELabelEncodeTest也就是对数据做了一些整理,其中主要是对text做了编码,self.encode方法如下:
def encode(self, text):
"""convert text-label into text-index.
input:
text: text labels of each image. [batch_size]
output:
text: concatenated text index for CTCLoss.
[sum(text_lengths)] = [text_index_0 + text_index_1 + ... + text_index_(n - 1)]
length: length of each text. [batch_size]
"""
if len(text) == 0 or len(text) > self.max_text_len:
return None
if self.lower:
text = text.lower()
text_list = []
for char in text:
if char not in self.dict:
# logger = get_logger()
# logger.warning('{} is not in dict'.format(char))
continue
text_list.append(self.dict[char])
if len(text_list) == 0:
return None
return text_list
其中把text字符按照dict做了映射,dict是什么,该到揭晓答案的时候了,在BaseRecLabelEncode的__init__方法中:
if character_dict_path is None:
logger = get_logger()
logger.warning(
"The character_dict_path is None, model can only recognize number and lower letters"
)
self.character_str = "0123456789abcdefghijklmnopqrstuvwxyz"
dict_character = list(self.character_str)
self.lower = True
else:
self.character_str = []
with open(character_dict_path, "rb") as fin:
lines = fin.readlines()
for line in lines:
line = line.decode('utf-8').strip("\n").strip("\r\n")
self.character_str.append(line)
if use_space_char:
self.character_str.append(" ")
dict_character = list(self.character_str)
dict_character = self.add_special_char(dict_character)
self.dict = {}
for i, char in enumerate(dict_character):
self.dict[char] = i
self.character = dict_character
可以看到该dict,就是character_dict_path所指的字典文件,而默认的character_dict_path配置是什么呢?
character_dict_path: ppocr/utils/ic15_dict.txt
答案就是ic15_dict.txt是一个英文数字字典!而我训练的是中文数据,导致在E2ELabelEncodeTest中无法对text进行编码,所以E2ELabelEncodeTest全部返回None,从而导致进入递归。
解决
在官网搜索了一番,我将字典换成了中文的:
character_dict_path: ppocr/utils/ppocr_keys_v1.txt
起码训练可以正常进行了。。
问题只是硬解了出来,但是并不理解,比如,评估时为什么要进行编码,为什么训练时E2ELabelEncodeTrain不用将字符编码,那一些transforms都有什么作用。。希望能早日提升理解水平。
更多推荐
所有评论(0)