提取PDF中的表格，按行列生成JSON数据，识别率100%

json

适用于现代 C++ 的 JSON。

项目地址：https://gitcode.com/gh_mirrors/js/json

免费下载资源

weixin_44214515

2789人浏览 · 2021-10-20 12:53:31

weixin_44214515 · 2021-10-20 12:53:31 发布

如果仅仅是提取PDF中的内容，基本没有难度，我后续会再写一篇博客来讨论提取内容。但是如果提取PDF中的表格，并按行列输出为JSON返回则并不简单，网上有很多资料，Github上同样也有一些，但是基本都是扯淡的。本文以在实际项目中使用的PDF提取程序为例，来介绍如何从PDF中提取表格数据，并按行列返回。

PDF中存在的内容可能有以下几种：
1.清晰无任何杂质的表格。
2.带有水印的文字表格
3.全部由图片组成的表格

其中第一种、第二种均比较容易实现，第三种需要将PDF中图片转存为图片，然后通过Tesseract进行识别提取。关于Tesseract识别提取的，在我的后续文章中会介绍，本文仅讨论前两种情况。

本文所采用的思想是首先对PDF进行预处理，如果含有水印，则去除水印（有水印会影响提取出的内容不正确。），然后将PDF输出为Excel，通过对Excel进行解析返回JSON数据。

下面的额代码检测了PDF中是否包含水印，如果包含水印，则去除水印：

Table execute() throws Exception {
			if (Objects.isNull(pdfPath)) {
				throw new IOException("PDF path is null, file doesnot exist");
			}

			File pdfFile = new File(pdfPath);
			if (!pdfFile.exists()) {
				throw new IOException("PDF doesnot exist:" + pdfFile.getPath());
			}

			// 创建一个临时工作目录
			File tempFolder = Utils.createTempWorkFolder();
			String convertedPdfPath = pdfPath;

			PDDocument document = PDDocument.load(new File(pdfPath));
			try {
				watermarkRemover.init(document);

				if (watermarkRemover.isWatermarkPDF()) {
					// 尝试去除水印
					watermarkRemover.removeWatermark();

					// 创建一个临时PDF文件
					convertedPdfPath = Utils.createTempFileName(tempFolder, null, "pdf");

					// 转存pdf为临时文件
					document.save(convertedPdfPath);
				}

				// 针对PDF文件导出为Excel
				File[] convertedExcels = Utils.pdf2Excel(document, tempFolder, convertedPdfPath);

				// 对转换后的excel文件进行分析
				Table table = parseExcel(convertedExcels);

				// 删除全部临时目录
				Utils.deleteFile(tempFolder);

				return table;
			} finally {
				document.close();
			}
		}

json

适用于现代 C++ 的 JSON。

项目地址：https://gitcode.com/gh_mirrors/js/json

那么如何将PDF解析为Excel呢？本文中引入了tabula类库，pom.xml中添加依赖如下：

<dependency>
    <groupId>technology.tabula</groupId>
    <artifactId>tabula</artifactId>
    <version>1.0.5</version>
</dependency>

在Java中将PDF转换为Excel代码如下：

/**
	 * 转换PDF为Excel
	 * 
	 * @description
	 * @param pdfDocument
	 * @param workFolder
	 * @param convertedPdfPath
	 * @return
	 * @throws Exception
	 */
	public static File[] pdf2Excel(PDDocument pdfDocument, File workFolder, String convertedPdfPath) throws Exception {
		if (pdfDocument.getNumberOfPages() == 0) {
			throw new Exception("PDF file is empty : page count is 0");
		}

		File[] converted = new File[pdfDocument.getNumberOfPages()];
		for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
			String tempExcel = createTempFileName(workFolder, String.valueOf(i), "csv");
			String[] params = new String[] { "-o", tempExcel, "-g", "-l", "-r", "-i", "-p", String.valueOf(i), convertedPdfPath };
			CommandLineParser parser = new DefaultParser();
			CommandLine line = parser.parse(CommandLineApp.buildOptions(), params);
			new CommandLineApp(System.out, line).extractTables(line);
			converted[i - 1] = new File(tempExcel);
		}
		return converted;
	}

该方法将PDF按页进行转换，每一页转换为一个Excel。这里要特别注意，千万不要将一个多页的PDF转换为一个Excel，一定要分页转换，因为tabula做多页转换时效果并不理想，后页转换可能丢失或完全不是想要的。

转换完Excel，我们就可以通过POI进行解析了。但是这里还是有很多坑的，转换完的Excel可能与你想象的格式并不同，有时差距比较大，在使用POI进行按行读取时是需要做特殊处理的。这里不再赘述，感兴趣的可以私聊我索要代码。

阅读全文

AI总结

GitHub 加速计划 / js / json

下载

适用于现代 C++ 的 JSON。

最近提交(Master分支：6 个月前 )

4424a0fc Signed-off-by: Niels Lohmann <mail@nlohmann.me> 4 天前

11aa5f94 * Make std::filesystem::path conversion to/from UTF-8 encoded JSON string explicit. Signed-off-by: Richard Musil <risa2000x@gmail.com> * Experimental: Changing C++ standard detection logic to accommodate potential corner cases. Signed-off-by: Richard Musil <risa2000x@gmail.com> * Drop C++ standard tests for compilers which do not implement required features. Signed-off-by: Richard Musil <risa2000x@gmail.com> * Drop C++ standard tests for MSVC versions which do not implement required features. Signed-off-by: Richard Musil <risa2000x@gmail.com> --------- Signed-off-by: Richard Musil <risa2000x@gmail.com> Co-authored-by: Richard Musil <risa2000x@gmail.com> 5 天前