Tesseract3的常用API

tesseract

tesseract-ocr/tesseract: 是一个开源的光学字符识别（OCR）引擎，适用于从图像中提取和识别文本。特点是可以识别多种语言，具有较高的识别准确率，并且支持命令行和API调用。

项目地址：https://gitcode.com/gh_mirrors/te/tesseract

免费下载资源

youngyang525

10609人浏览 · 2014-03-01 20:23:26

youngyang525 · 2014-03-01 20:23:26 发布

初始化函数

(1) int Init(const char* datapath, const char* language,  char **configs, int configs_size, bool configs_global_only);
(2) int Init(const char* datapath, const char* language) { return Init(datapath, language, 0, 0, false);  }
(3) int InitLangMod(const char* datapath, const char* language);
(4) int InitWithoutLangModel(const char* datapath, const char* language);

函数主要参数：datapath表示语言包的路径，language:语言使用ISO 639-3 string或者默认使用英文（NULL），比如中文为”chi_sim”,英文为默认(NULL)或者写“eng”，其他的参数可采用默认。

初始化函数

(1) int Init(const char* datapath, const char* language,  char **configs, int configs_size, bool configs_global_only);
(2) int Init(const char* datapath, const char* language) { return Init(datapath, language, 0, 0, false);  }
(3) int InitLangMod(const char* datapath, const char* language);
(4) int InitWithoutLangModel(const char* datapath, const char* language);

函数主要参数：datapath表示语言包的路径，language:语言使用ISO 639-3 string或者默认使用英文（NULL），比如中文为”chi_sim”,英文为默认(NULL)或者写“eng”，其他的参数可采用默认

图片输入函数

(1) char* TesseractRect(const unsigned char* imagedata, int bytes_per_pixel, int bytes_per_line,

                        int left, int top, int width, int height);

TesseractRect函数:输入需要处理的图片，并且设定区域，imagedata：8位或者24位，32位彩色图片，其他调色板的图片需转换为24位图像
bytes_per_pixel：每像素的字节数；bytes_per_line,每行的字节数（对齐后的），其他的不解释

这个函数也可以拆分为一下几个函数：

(2) void SetImage(const unsigned char* imagedata, int width, int height, int bytes_per_pixel, int bytes_per_line);
(3)  void SetRectangle(int left, int top, int width, int height);

SetImage函数：输入需要处理的图片，和TesseractRect的参数解释相同，注意的是这个函数会修改输入的图像

SetRectangle：设置需要处理的区域

获得识别结果

（4）char* GetUTF8Text();

获取文字图像中的文字信息，UTF8格式，API上说需要对获取的char*进行delete，但是我在测试的delete[]会出现错误。

对字符信任度评价

（5）int MeanTextConf();   //获取图像中文字识别结果的平均可信任度,大小为0~100
（6）int* AllWordConfidences(); //获取每个字符的可信任度，与GetUTF8Text获取的字符对应，值为0~100之间

个人觉得这类函数也是蛮重要的一类，可以对识别的结果做出大致的评价，对于评价较差的，可以另作处理，我测试的时候，做的好的识别，信任度识别都在80以上，做的不好的，就在80一下，还是可以大致说明识别结果的大致情况。

结束函数：

（7）void Clear(); //清tesseract的内部图片空间以及识别结果，可以多次使用
（8）void End();  //释放tesseract的所有内存，释放API

记得释放，尤其是循环使用的时候，使用clear释放上一次操作的空间。

tesseract也提供一些输出中间过程的函数，我没做研究，没有测试，API说明如下：

 
 /*在SetImage或者TesseractRect之后，获取内部阈值后图像的一个COPY*/
  Pix* GetThresholdedImage();

  /*获得版面分析的结果（layout analysis）
    在分析之前或者之后调用.*/
  Boxa* GetRegions(Pixa** pixa);

  /**
   * Get the textlines as a leptonica-style
   * Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   * If blockids is not NULL, the block-id of each line is also returned
   * as an array of one element per line. delete [] after use.
   */
  Boxa* GetTextlines(Pixa** pixa, int** blockids);

  /**
   * Get the words as a leptonica-style
   * Boxa, Pixa pair, in reading order.
   * Can be called before or after Recognize.
   */
  Boxa* GetWords(Pixa** pixa);

  // Gets the individual connected (text) components (created
  // after pages segmentation step, but before recognition)
  // as a leptonica-style Boxa, Pixa pair, in reading order.
  // Can be called before or after Recognize.
  // Note: the caller is responsible for calling boxaDestroy()
  // on the returned Boxa array and pixaDestroy() on cc array.
  Boxa* GetConnectedComponents(Pixa** cc);

  // Get the given level kind of components (block, textline, word etc.) as a
  // leptonica-style Boxa, Pixa pair, in reading order.
  // Can be called before or after Recognize.
  // If blockids is not NULL, the block-id of each component is also returned
  // as an array of one element per component. delete [] after use.
  Boxa* GetComponentImages(PageIteratorLevel level,
                           Pixa** pixa, int** blockids);

上面的函数足以完成图像字符的识别，但是tesseract也提供了其他函数，比如图像读取，对识别的字符可信性进行评估以及获取识别过程中的中间图像

读取图像函数

(1) INT8 IMAGE::read_header ( const char *  name  );
(2) inT32 check_legal_image_size(                     //get rest of image
inT32 x,                      //x size required
inT32 y,                    //ysize required
inT8 bits_per_pixel  //bpp required
);
(3)inT8 read(inT32 buflines);

参考别人的例子的时候，会使用这个函数读取函数，但是我在使用的时候，发现3.0的版本并没发现IMAGE类里面的read函数和

read_header函数，可能是我用的文件问题吧，但是我本省也不想使用这个类，更想使用opencv完成图像的读取和预处理的工作，这里不多做说明了，如果哪位知道是哪里问题，可以告诉我哦。。。不适用提供的函数，使用OPENCV其实也很方便，不需要做任何转换，看下面的代码：

	IplImage *iplimg =  NULL;
	iplimg = cvLoadImage("1.jpg");
	tesseract::TessBaseAPI  api;
	//api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwsyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
	//api.SetVariable("classify_bln_numeric_mode", "123456789");
	api.Init("C:\\BuildFolder\\tesseract-3.01\\tessdata", NULL);
	//api.SetPageSegMode(PSM_SINGLE_BLOCK);
	api.SetImage((unsigned char*)(iplimg->imageData), 
						iplimg->width, iplimg->height,iplimg->nChannels  , iplimg->widthStep);//设置图像
	char* text = api.GetUTF8Text();//识别图像中的文字

这里是我的整个简单测试代码：


#include "stdafx.h"

#include "allheaders.h"
#include "baseapi.h"
#include "resultiterator.h"
#include "strngs.h"
#include "blobs.h"

#include "cv.h"
#include "highgui.h"
#include "cxcore.h"

#include "stdlib.h"
using namespace  tesseract;

int _tmain(int argc, _TCHAR* argv[])

{
	STRING text_out;

	IplImage *iplimg =  NULL;
	iplimg = cvLoadImage("1.jpg");
	tesseract::TessBaseAPI  api;
	//api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwsyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
	//SetVariable("tessedit_char_blacklist", "xyz"); to ignore x, y and z.
	//api.SetVariable("classify_bln_numeric_mode", "123456789");

	api.Init("C:\\BuildFolder\\tesseract-3.01\\tessdata", NULL);
	//api.SetPageSegMode(PSM_SINGLE_BLOCK);
	api.SetImage((unsigned char*)(iplimg->imageData), 
						iplimg->width, iplimg->height,iplimg->nChannels  , iplimg->widthStep);//设置图像
	char* text = api.GetUTF8Text();//识别图像中的文字

	printf("%s\n","获得的结果");
	printf("%s\n",text);
	FILE* fout = fopen("txt_file.TXT", "w");


	//fwrite(text_out.string(), 1, text_out.length(), fout);//将识别结果写入输出文件

	fprintf(fout,"%s\n","获得的结果");
	fprintf(fout,"%s\n",text);
	fclose(fout);

	
	UINT d = api.MeanTextConf();
	fprintf(fout,"%d\n",d);
	printf("%d\n",d);

	int *gg = api.AllWordConfidences();

	while (*gg != '\0')
	{
		printf("%d\n",*gg);
		gg ++ ;
	}


	getchar();

	api.Clear();
	api.End();
	return 0;

}

转载自：http://www.cnblogs.com/zsb517/archive/2012/06/06/2537540.html

GitHub 加速计划 / te / tesseract

60.1 K

9.29 K

下载

最近提交(Master分支：2 个月前 )

bc490ea7 Don't check for a directory, because a symbolic link is also allowed. Signed-off-by: Stefan Weil <sw@weilnetz.de> 4 个月前

2991d36a - 4 个月前

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m