之前的OCRus开发工作告一段落,后端OCR识别利用开源OCR引擎Tesseract。对于论文类型的文档,字体标准,大小一致,识别率很高,根据UNLV的测试结果,Tesseract的准确率都在90%以上,但对OCRus面向的手机照片,识别准确率并不高,对一些图片基本不可用。虽然OCRus做了一些图片预处理的工作,希望在将图片送入Tesseract之前能够使图片更清晰,更利于识别,但对识别结果不好的图片,经过预处理步骤对准确率提升并不大。最近一直在看Tesseract源代码,希望从Tesseract源码入手,利用Ceiling Analysis的方法确定Tesseract识别的主要瓶颈,未来有针对性地改进Tesseract。
目录 目录 源码分析环境部署 数据结构 源码分析 Page Layout 分析步骤 二值化 预处理 Remove vertical lines Remove images Filter connected component Finding candidate tab-stop components Finding the column layout Finding the regions 接下来的工作 源码分析环境部署 Tesseract 3.02 提供了Visual Studio 2008的工程项目,部署过程如下:
Setting up Tesseract-OCR Building Tesseract-OCR Tesseract 同时提供了一个Java UI来显示中间结果,部署过程如下:
ViewerDebugging 注意: Introduction里的piccolo版本说明有Bug,ScrollView源码里用的是1.2,用新版本piccolo无法运行,Github上的Tesseract项目已经修复此Bug。 程序入口点在api/tesseractmain.cpp中:
```
int main(int argc, char **argv) { … } 1 2 3 Tesseract设置了很多变量来控制中间结果是否输出,在以下代码:
if (!api.ProcessPages(image, NULL, 0, &text_out)) { fprintf(stderr, _("Error during processing.\n")); } 1 2 3 之前加入下面代码:
api.SetVariable("tesseditdumppagesegimages", "true"); //show no lines and no image picture api.SetVariable("textordshowblobs", "true"); //show blobs result api.SetVariable("textordshowboxes", "true"); //show blobs' bounding boxes api.SetVariable("textordtabfindshowblocks", "true"); //show candidate tab-stops and tab vectors api.SetVariable("textordtabfindshowrejectblobs", "true"); //show rejected blobs api.SetVariable("textordtabfindshowinitialpartitions", "true"); //show initial partitions api.SetVariable("textordtabfindshowpartitions", "1"); //show final partitions api.SetVariable("textordtabfindshowinitialtabs", "true"); //show initial tab-stops api.SetVariable("textordtabfindshowfinaltabs", "true"); //show final tab vectors api.SetVariable("textordtabfindshowimages", "true"); //show image blobs 1 2 3 4 5 6 7 8 9 10
```
使Tesseract输出所有结果。
工程导入VS2008后会有11个项目,设置tesseract为启动项目并设置命令行参数为 {imagepath} {textbase} segdemo inter
数据结构 Page analysis result: PAGE_RES
(ccstruct/pageres.h). Page analysis result contains a list of block analysis result field: BLOCK_RES_LIST
. Block analysis result: BLOCK_RES
(ccstruct/pageres.h). Block analysis result contains a list of row analysis result field: ROW_RES_LIST
. Row analysis result: ROW_RES
(ccstruct/pageres.h). Row analysis result contains a list of word analysis result field: WERD_RES_LIST
. WERD_RES
(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result. 1 2 3 4 5 6 7 源码分析 Page Layout 分析步骤 二值化 算法: OTSU 调用栈: main[api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages[api/baseapi.cpp] -> TessBaseAPI::ProcessPage[api/baseapi.cpp] -> TessBaseAPI::Recognize[api/baseapi.cpp] -> TessBaseAPI::FindLines[api/baseapi.cpp] -> TessBaseAPI::Threshold[api/baseapi.cpp] -> ImageThresholder::ThresholdToPix[ccmain/thresholder.cpp] -> ImageThresholder::OtsuThresholdRectToPix [ccmain/thresholder.cpp] 1 2 3 4 5 6 7 8 OTSU 是一个全局二值化算法. 如果图片中包含阴影而且阴影不平均,二值化算法效果就会比较差。OCRus利用一个局部的二值化算法,Wolf Jolion, 对包含有阴影的图片也有比较好的二值化结果,以下是一些对比图:(左为原图, 中间为用OTSU算法结果图, 右边为WolfJolion算法结果图):
预处理 Remove vertical lines This step removes vertical and horizontal lines in the image.
调用栈 main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp] LineFinder::FindAndRemoveLines [textord/linefind.cpp] 1 2 3 4 5 6 7 8 9 Remove images This step remove images from the picture.
调用栈 main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp] ImageFind::FindImages [textord/linefind.cpp] 1 2 3 4 5 6 7 8 9 I never try this function successfully. May be the image needs to satisfy some conditions.
Filter connected component This step generate all the connected components and filter the noise blobs.
调用栈 main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp] -> (i) Textord::findcomponents [textord/tordmain.cpp] -> { extractedges[textord/edgblob.cpp] //extract outlines and assign outlines to blobs assignblobstoblocks2[textord/edgblob.cpp] //assign normal, noise, rejected blobs to TOBLOCKLIST for further filter blobs operations Textord::filterblobs[textord/tordmain.cpp] -> Textord::filternoiseblobs[textord/tordmain.cpp] //Move small blobs to a separate list } (ii) ColumnFinder::SetupAndFilterNoise [textord/colfind.cpp] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 This step will generate the intermediate result like this:
The inner and outer outline of the connected component will be recognized. There will be a box area overlap the connected component. The potential small noise blobs will be marked as pink outlines, such as punctuation and dot in character “i”. The large blobs will be marked as dark green color:
Finding candidate tab-stop components 调用栈 main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp] -> TabFind::FindInitialTabVectors[textord/tabfind.cpp] -> TabFind::FindTabBoxes [textord/tabfind.cpp] 1 2 3 4 5 6 7 8 9 10 This step finds the initial candidate tab-stop CCs by a radial search starting at every filtered CC from preprocessing. The result will be like this:
Finding the column layout 调用栈 main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp] -> ColumnFinder::FindBlocks (begin at line 369) [textord/ colfind.cpp] 1 2 3 4 5 6 7 8 9 This step finds the column layout of the page:
Finding the regions 调用栈 main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp] 1 2 3 4 5 6 7 8 This step recognizes the different type of blocks:
接下来的工作 找tab-stops及之后处理步骤的算法还不甚清楚,需要继续了解
识别字符部分还没开始看,这部分应该有涉及机器学习的多种算法,有时间需要继续学习
作者:kaelsass 来源:CSDN 原文:https://blog.csdn.net/kaelsass/article/details/46874627 版权声明:本文为博主原创文章,转载请附上博文链接!
所有评论(0)