Linux下利用jTessBoxEditor工具进行Tesseract3.02.02样本训练

tesseract

tesseract-ocr/tesseract: 是一个开源的光学字符识别（OCR）引擎，适用于从图像中提取和识别文本。特点是可以识别多种语言，具有较高的识别准确率，并且支持命令行和API调用。

项目地址：https://gitcode.com/gh_mirrors/te/tesseract

免费下载资源

chudongfang2015

3012人浏览 · 2016-10-18 00:24:28

chudongfang2015 · 2016-10-18 00:24:28 发布

Linux下利用jTessBoxEditor工具进行Tesseract3.02.02样本训练

1.准备样本图片

为提高识别率，把图片进行灰度化处理，可以用

convert -monochrome name.png name.png

把图片变成黑白色

然后利用

convert name.jpg name.tif
命令，把其他类型的图片文件转换成.tif文件

2.合并样本图片

打开jtessboxeditor，点击Tools->Merge Tiff ，按住shift键选择tif文件，并把生成的tif命名为fontname.fonttype.exp0.tif

3.生成box文件

执行命令生成fontname.fonttype.exp0.box文件

tesseract   fontname.fonttype.exp0.tif    fontname.fonttype.exp0

-l   eng  <span style="font-family: 'Open Sans', sans-serif; line-height: 2em;">-psm  3 </span><span style="line-height: 2em; font-family: 'Open Sans', sans-serif;">batch.nochop   makebox</span>

下面是-psm参数的大意，可以根据其选择对应的数

Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile .

4.修改box文件

切换到jTessBoxEditor工具的Box Editor页，点击open，打开前面的tiff文件fontname.fonttype.exp0.tif，工具会自动加载对应的box

检查box数据，逐个核对tif文件的box数据，全部修改结束并保存。

5.生成font_properties

执行echo命令生成font_properties。

echo fontyp 0 0 0 0 0 >font_properties

也可以手工新建一个名为font_properties的文本文件（注意该文件没有扩展名），内容为字体名fonttype，后面带5个0，分别代表字体的粗体、斜体等属性，这里全部是0

6.生成训练文件

执行命令，生成fontname.fonttype.exp0.tr训练文件

tesseract fontname.fonttype.exp0.tif fontname.fonttype.exp0 <span style="font-family: 'Open Sans', sans-serif; line-height: 2em;">-l eng -psm 3 nobatch box.train</span>

注意，里面的eng项可以根据实际情况修改

7.生成字符集文件

执行命令，生成名为unicharset的字符集文件。

unicharset_extractor fontname.fonttype.exp0.box

8.生成shape文件

执行命令，生成shape文件

shapeclustering -F font_properties -U unicharset -O fontname.unicharset <span style="font-family: 'Open Sans', sans-serif; line-height: 2em;">fontname.fonttype.exp0.tr</span>

9.生成聚集字符特征文件

执行命令，生成3个特征字符文件，unicharset、inttemp、pffmtable

mftraining -F font_properties -U unicharset -O fontname.unicharset fontname.fonttype.exp0.tr

10.生成字符正常化特征文件

执行命令，生成正常化特征文件normproto。

cntraining fontname.fonttype.exp0.tr

11.更名

执行命令，把步骤9，步骤10生成的特征文件进行更名。

mv normproto fontname.normproto
mv inttemp fontname.inttemp
mv pffmtable fontname.pffmtable 
mv unicharset fontname.unicharset
mv shapetable fontname.shapetable