使用OCR自动化识别,一般识别率不是太高,处理一般简单验证码还是没问题,这里使用的是Tesseract-OCR,下载地址:http://pan.baidu.com/s/1kUGaw8R
怎么使用呢?
首先,环境变量path添加tesseract-ocr的安装路径,然后使用命令窗口查看:
如果出现如上输出,表示安装正常。
我准备了一张验证码cp.png放在e盘tesseract目录下
:
结果为:
现在,具体实践,先准备一份网页:
- <html>
- <head>
- <title>验证码</title>
- </head>
- <body>
- <form>
- <td>验证码:</td>
- <input id="cp" type="text"/>
- <img src="http://www.csti.cn/uc/index/verify.htm">
- </form>
- </body>
- </html>
要识别验证码,首先得取得验证码,首先获取整个页面的截图,然后找到页面元素坐标进行截取:
-
- public static void captureElement(WebDriver driver, WebElement element, String path){
-
- File srcFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
- try {
-
- int width = element.getSize().getWidth();
- int height = element.getSize().getHeight();
-
- Rectangle rect = new Rectangle(width, height);
-
- Point p = element.getLocation();
- BufferedImage img = ImageIO.read(srcFile);
- BufferedImage dest = img.getSubimage(p.getX(), p.getY(), rect.width,rect.height);
-
- ImageIO.write(dest, "png", srcFile);
- Thread.sleep(1000);
- FileUtils.copyFile(srcFile, new File(path));
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
截取完元素,就可以调用Tesseract-OCR生成text:
- Runtime rt = Runtime.getRuntime();
- rt.exec("cmd.exe /C tesseract e:\\tesseract\\cp.png e:\\tesseract\\cp");
接下来通过java读取txt:
- public static String readTextFile(String filePath) {
- String lineTxt = null;
- try {
- String encoding = "GBK";
- File file = new File(filePath);
- if (file.isFile() && file.exists()) {
- InputStreamReader read = new InputStreamReader(
- new FileInputStream(file), encoding);
- BufferedReader bufferedReader = new BufferedReader(read);
- while ((lineTxt = bufferedReader.readLine()) != null) {
- return lineTxt;
- }
- read.close();
- } else {
- System.out.println("找不到指定的文件");
- }
- } catch (Exception e) {
- System.out.println("读取文件内容出错");
- e.printStackTrace();
- }
- return lineTxt;
- }
最后,直接调用:
- public static void main(String[] args) throws IOException, InterruptedException {
- WebDriver driver = new FirefoxDriver();
- driver.manage().window().maximize();
- driver.get("file:///E:/tesseract/cp.html");
- WebElement cp = driver.findElement(By.xpath("//img"));
- captureElement(driver, cp, "e:\\tesseract\\cp.png");
- Runtime rt = Runtime.getRuntime();
- rt.exec("cmd.exe /C tesseract e:\\tesseract\\cp.png e:\\tesseract\\cp");
- Thread.sleep(1000);
- String cp2 = readTextFile("e:\\tesseract\\cp.txt");
- driver.findElement(By.id("cp")).sendKeys(cp2);
-
- }
tesseract-ocr/tesseract: 是一个开源的光学字符识别(OCR)引擎,适用于从图像中提取和识别文本。特点是可以识别多种语言,具有较高的识别准确率,并且支持命令行和API调用。
最近提交(Master分支:2 个月前 )
bc490ea7
Don't check for a directory, because a symbolic link is also allowed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
4 个月前
2991d36a - 4 个月前
所有评论(0)