Java OpenCV+Tesseract实现提取图标中的表格并按行列返回JSON
在我的其他几篇文章中介绍了Tesseract识别中文+数字+字母以及PDF去水印的一些技巧。当整个PDF都是由图片构成(如扫描件)时,如何提取PDF中的表格并按行列返回JSON数据呢?
一种方法就是将PDF中的图片转存为图片,然后通过对图片的识别来达到目的。Github上有一些诸如:CascadeTabNet、CDecNet的Deep Learning项目,百度和腾讯我也看了,有类似的Deep Learning项目。我试用了CascadeTabNet(目前Github上92颗星)以及百度的图片表格识别Deep Learning项目,其中CascadeTabNet11个G,百度的19个G。试验的结果感觉还可以,对小篇幅的图片识别准确率还可以,但是对大尺寸的图片(如A4纸)识别正确率很低。并且无法以JSON数据返回。
在这里我介绍另外一种通过OpenCV+Tesseract技术实现对图片中表格提取的方法,该方法可以提取更加复杂的表格(如嵌套表)。
思路:本文的思路是通过OpenCV对图片进行检测,检测完毕后返回关键数据,然后通过设计工具在图片上进行划定区域切割,并生成单页元数据,通过元数据对图片进行表格数据。
1. OpenCV表格检测 (可完成60%的表格识别)
2. 设计工具,Vue开发的一个小工具,可对OpenCV返回的格子数据进行再次加工。(用于实现100%的表格检测)
3. 通过国OpenCV进行表格识别。
4.通过Tesseract进行OCR识别
5. 转换为JSON返回
下面上源代码(本文并不会对源代码做过多介绍,请仔细深读),第一部分:OpenCV表格检测:
/**
* 解析图片中的格子
*
* @description
*
*/
public Table parseImageTableStructure(String path) {
// 图像倾斜度调整
pictureTiltCorrection(path);
Mat src = Imgcodecs.imread(path);
// 1. 将图片灰度化
Mat gray = OpenCVUtils.gray(src);
// 2. 将图片二值化
Mat adaptiveThreshold = OpenCVUtils.adaptiveThreshold(gray);
// 3. 膨胀+腐蚀:补全表格线内的空洞
Mat element = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3));
Imgproc.dilate(adaptiveThreshold, adaptiveThreshold, element);
Imgproc.erode(adaptiveThreshold, adaptiveThreshold, element);
// 4. 获得横线
Mat horizontalLine = getHorizontal(adaptiveThreshold.clone());
// 5. 获得竖线
Mat verticalLine = getVertical(adaptiveThreshold.clone());
// 6. 横竖线合并
Mat tableLine = OpenCVUtils.getOr(horizontalLine, verticalLine);
// 7. 通过 bitwise_and 定位横线、垂直线交汇的点
Mat points_image = new Mat();
Core.bitwise_and(horizontalLine, verticalLine, points_image);
// 8. 查找轮廓
List<MatOfPoint> contours = new ArrayList<MatOfPoint>();
Mat rootHierarchy = new Mat();
Imgproc.findContours(tableLine, contours, rootHierarchy, Imgproc.RETR_LIST, Imgproc.CHAIN_APPROX_TC89_KCOS, new Point(0, 0));
// 9. 分析轮廓
List<MatOfPoint> contours_poly = contours;
Rect[] boundRect = new Rect[contours.size()];
LinkedList<MatWithProperty> tables = new LinkedList<MatWithProperty>();
// 循环所有找到的轮廓-点
for (int i = 0; i < contours.size(); i++) {
MatOfPoint point = contours.get(i);
MatOfPoint contours_poly_point = contours_poly.get(i);
double area = Imgproc.contourArea(contours.get(i));
// 如果小于某个值就忽略,代表是杂线不是表格
if (area < 100) {
continue;
}
Imgproc.approxPolyDP(new MatOfPoint2f(point.toArray()), new MatOfPoint2f(contours_poly_point.toArray()), 3, true);
// 为将这片区域转化为矩形,此矩形包含输入的形状
boundRect[i] = Imgproc.boundingRect(contours_poly.get(i));
// 找到交汇处的的表区域对象
Mat table_image = points_image.submat(boundRect[i]);
List<MatOfPoint> table_contours = new ArrayList<MatOfPoint>();
Mat joint_mat = new Mat();
Imgproc.findContours(table_image, table_contours, joint_mat, Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_TC89_L1);
// 从表格的特性看,如果这片区域的点数小于4,那就代表没有一个完整的表格,忽略掉
if (table_contours.size() < 4)
continue;
// 提取矩形数据
MatWithProperty mp = new MatWithProperty(null, boundRect[i]);
tables.addFirst(mp);
}
ImageTable table = new ImageTable();
table.setImageHeight(src.rows());
table.setImageWidth(src.cols());
// 10. 生成桶
List<Row> horBuckets = new ArrayList<>();
table.setRows(horBuckets);
// 生成横桶
createRowBuckets(tables, horBuckets);
// 遍历横桶,
for (Row row : horBuckets) {
RowBucket bucket = (RowBucket) row;
List<MatWithProperty> rowMats = bucket.elements;
// 生成列桶
List<ColBucket> verBuckets = new ArrayList<>();
createColBuckets(rowMats, verBuckets);
}
// 返回结构
return table;
}
返回的table就是表结构数据,你可以给它理解为表格蒙板数据。这个识别对于简单、清晰的表格可以100%识别,但是对于大表、嵌套表识别率60%左右,所以为了达到100%识别,我们需要对表格结构数据进行再次设计,这次设计就需要通过UI界面来进行了。下面是设计页面,使用Vue开发:
<template>
<div class="box ocr-design-wrapper">
<div style="height:50px;">
<h1 style="font-size:24px;text-align:center;height:50px;line-height:50px;">识别图片中的表格数据</h1>
</div>
<div class="top-header" style="height:50px;" v-if="fileId && fileId.length > 0">
<el-button type="primary" @click="saveDesign" size="big" style="width: 150px">保存设计</el-button>
<el-button type="warning" @click="showData" size="big" style="width: 150px">显示数据</el-button>
</div>
<template v-if="!fileId || fileId.length == 0">
<div v-loading="imageParseLoading">
<el-upload class="upload-demo" drag :action="Constants.uploadServer" :before-upload="beforeUpload" :multiple=false :show-file-list="false" :on-success="handleImageSuccess">
<i class="el-icon-upload"></i>
<div class="el-upload__text">将图片文件拖到此处,或<em>点击上传</em></div>
<div class="el-upload__tip" slot="tip" style="text-align:center;">只能上传jpg/png文件,且不超过10MB</div>
</el-upload>
</div>
</template>
<template v-else>
<el-row :gutter="10" class="design-port" ref="designPort" v-loading="imageParseLoading">
<el-col :span="24" class="design-port-container">
<img ref="backgroundImage" class="left-view" :src="`${this.fileId}`" @load="afterBackgroundImageLoaded" />
<!-- 在canvas上绘制格子 -->
<div class="bounding-container">
<div ref="sketchContainer" class="sketch-container" :style="{width : backgroundImageSize.width, height: backgroundImageSize.height} " v-loading="sizeLoading" @dblclick="onDoubleClick($event)">
<template v-if="!sizeLoading">
<template v-for="(c,index) in cells">
<vue-draggable-resizable class="bound-box" :w="c.w" :h="c.h" :x="c.x" :y="c.y" @dragging="onDrag" @resizing="onResize" @activated="onActivated(c)" @deactivated="onDeactivated" @dragstop="(x,y)=>{onDragStop(x,y,c)}" @resizestop="(x,y,w,h) => {onResizeStop(x,y,w,h,c)}" :parent="true" :key="c.id">
<i class="el-icon-close bound-box-close" @click="onClickCloseBoundingBox(index, c)"></i>
</vue-draggable-resizable>
</template>
</template>
</div>
</div>
</el-col>
</el-row>
</template>
<el-drawer title="属性数据" :visible.sync="drawer" :direction="direction">
<div style="padding:20px;border: 1px solid #d2d2d2;border-radius: 8px;margin: 10px;overflow-y: auto;max-height: 400px;">
{{attributeData}}
</div>
<div style="width:100%; margin:auto 0;text-align: center;">
<el-button type="primary" style="width: 90%;" @click="copyCells($event)">点击保存</el-button>
</div>
</el-drawer>
</div>
</template>
<script>
import '@/components/vue-draggable-resizable/dist/VueDraggableResizable.css';
import Clipboard from 'clipboard';
const CELL_WIDTH = 100,
CELL_HEIGHT = 100;
export default {
components: {},
data() {
return {
drawer: false,
attributeData: '',
tableStructure: null,
direction: 'rtl',
imageParseLoading: false,
sizeLoading: true,
currentSelectedCell: null,
fileId: '',
backgroundImageSize: {
width: '0px',
height: '0px',
w: 0,
h: 0,
},
cells: []
};
},
computed: {},
mounted() {
var that = this;
document.onkeydown = function(e) {
if (window.event.keyCode == 46) {
that.removeCell();
}
};
},
methods: {
onDragStop(x, y, cell) {
cell.x = x;
cell.y = y;
},
onResizeStop(x, y, w, h, cell) {
cell.x = x;
cell.y = y;
cell.w = w;
cell.h = h;
},
scaleSourceCoordinate(h1, h2, h3) {
return h2 * h3 / h1;
},
generateAttributeData() {
// 用比例进行转换
let attributeArray = [];
if (this.cells && this.cells.length > 0) {
this.cells.forEach(rect => {
attributeArray.push({
rect: {
height: this.scaleSourceCoordinate(this.backgroundImageSize.h, this.tableStructure.imageHeight, rect.h),
width: this.scaleSourceCoordinate(this.backgroundImageSize.w, this.tableStructure.imageWidth, rect.w),
x: this.scaleSourceCoordinate(this.backgroundImageSize.w, this.tableStructure.imageWidth, rect.x),
y: this.scaleSourceCoordinate(this.backgroundImageSize.h, this.tableStructure.imageHeight, rect.y)
}
});
});
}
let table = {
imageWidth: this.tableStructure.imageWidth,
imageHeight: this.tableStructure.imageHeight,
cells: attributeArray
}
Vue.set(this, 'attributeData', JSON.stringify(table));
},
copyCells(e) {
const clipboard = new Clipboard(e.target, {
text: () => this.attributeData
})
clipboard.on('success', () => {
this.$message.success('已复制到剪贴板');
// 释放内存
clipboard.destroy()
})
clipboard.on('error', () => {
// 不支持复制
this.$message.error('复制失败');
// 释放内存
clipboard.destroy()
});
clipboard.onClick(e);
},
showData() {
this.generateAttributeData();
this.drawer = true;
},
removeCell() {
// 判断当前是否有选中Cell
if (this.currentSelectedCell && this.currentSelectedCell != null) {
let index = this.cells.indexOf(this.currentSelectedCell);
if (index != -1) {
this.cells.splice(index, 1);
}
}
},
onActivated(cell) {
this.currentSelectedCell = cell;
},
onDeactivated() {
this.currentSelectedCell = null;
},
getInt(str) {
if (str.endsWith('px')) {
str = str.substring(0, str.length - 2);
}
return parseInt(str);
},
beforeUpload() {
this.imageParseLoading = true;
},
handleImageSuccess(file, data, response) {
this.fileId = file.url;
// 获取图片实际路径
this.parseImageStructure();
},
parseImageStructure() {
Service.post('/imgtableocr/ocr/parseImageStructure', { fileId: this.fileId }).then(result => {
this.imageParseLoading = false;
if (result.success) {
if (result.data && result.data && result.data.rows && result.data.rows.length > 0) {
this.tableStructure = result.data;
this.computeRows(result.data);
} else {
this.cells = [];
}
} else {
this.$message.error(result.message);
}
}).catch(error => {
this.$message.error('图片识别失败');
});
},
/**
* 计算矩形
*/
computeRows(table) {
let p = [];
if (table.rows && table.rows.length > 0) {
table.rows.forEach(row => {
if (row.elements && row.elements.length > 0) {
row.elements.forEach(cell => {
if (cell.rect) {
p.push({
h: this.scaleSourceCoordinate(table.imageHeight, this.backgroundImageSize.h, cell.rect.height),
w: this.scaleSourceCoordinate(table.imageWidth, this.backgroundImageSize.w, cell.rect.width),
x: this.scaleSourceCoordinate(table.imageWidth, this.backgroundImageSize.w, cell.rect.x),
y: this.scaleSourceCoordinate(table.imageHeight, this.backgroundImageSize.h, cell.rect.y)
});
}
});
}
});
}
Vue.set(this, "cells", p)
},
afterBackgroundImageLoaded() {
let height = window.getComputedStyle(this.$refs.backgroundImage).height;
let width = window.getComputedStyle(this.$refs.backgroundImage).width;
Vue.set(this.backgroundImageSize, 'width', width);
Vue.set(this.backgroundImageSize, 'height', height);
Vue.set(this.backgroundImageSize, 'w', this.getInt(width));
Vue.set(this.backgroundImageSize, 'h', this.getInt(height));
this.sizeLoading = false;
},
/**
* 关闭盒子
*/
onClickCloseBoundingBox(index, cell) {
this.cells.splice(index, 1);
},
/**
* 双击增加格子
*/
onDoubleClick(e) {
let x = e.offsetX;
let y = e.offsetY;
// 边界判定
if (x <= 0) {
x = 0;
}
if (y <= 0) {
y = 0;
}
if (x >= (this.backgroundImageSize.w - CELL_WIDTH - 4)) {
x = this.backgroundImageSize.w - CELL_WIDTH - 4;
}
if (y >= (this.backgroundImageSize.h - CELL_HEIGHT - 4)) {
y = this.backgroundImageSize.h - CELL_HEIGHT - 4;
}
this.cells.push({
id: this.Utils.uuid(),
w: 100,
h: 100,
x: x,
y: y
});
},
onDrag() {
},
onResize() {
},
saveDesign() {
Service.post("/ocr/image/table/design/save", this.cells).then(result => {
if (result.success) {
}
});
}
}
};
</script>
<style lang="less" scoped>
.ocr-design-wrapper {
/deep/.upload-demo {
width: 100%;
.el-upload {
width: 100%;
.el-upload-dragger {
width: 100%;
}
}
}
.design-port-container {
width: 100%;
position: relative;
}
.bounding-container {
width: 100%;
height: 100%;
position: absolute;
left: 0;
top: 0;
.sketch-container {
/deep/.bound-box {
background-color: rgba(100, 255, 187, 0.4);
.bound-box-close {
display: none;
cursor: pointer;
position: absolute;
right: -7px;
top: -7px;
background-color: #00c9ff;
color: #ffffff;
}
&.active {
.bound-box-close {
display: block;
}
}
.handle-tl {
top: -5px;
left: -5px;
}
.handle-tm {
top: -5px;
}
.handle-tr {
right: -5px;
top: -5px;
}
.handle-mr {
right: -5px;
}
.handle-ml {
left: -5px;
}
.handle-bl {
left: -5px;
bottom: -5px;
}
.handle-bm {
bottom: -5px;
}
.handle-br {
bottom: -5px;
right: -5px;
}
}
}
}
.top-header {
height: 50px;
line-height: 50px;
padding-left: 20px;
background-color: #dedede;
width: 100%;
left: 0px;
z-index: 99999999;
border-radius: 6px;
}
.design-port {
margin-top: 10px;
.left-view {
width: 100%;
height: 100%;
}
}
}
</style>
<style lang="less" scoped>
.btn-wrap {
position: fixed;
top: 50%;
right: 10px;
z-index: 14;
width: 80px;
/deep/ .el-button {
margin-bottom: 10px;
opacity: 0.6;
&:hover {
opacity: 1;
}
}
/deep/ .el-button+.el-button {
margin-left: 0;
}
}
.innerDom {
display: none !important;
}
.box {
padding: 20px;
}
.comp-wrap {
width: 313px;
float: left;
height: 736px;
}
.page-wrap {
width: 100%;
float: left;
padding: 0 !important;
}
.edit-wrap {
position: relative;
float: left;
width: 348px;
height: 736px;
}
.drag-sty {
border: 1px solid #e6e6e6;
width: 100px;
padding: 6px;
font-size: 12px;
height: 30px;
display: inline-block;
line-height: 18px;
}
.iconfont-back {
background: #ccc;
border-radius: 2px;
padding: 0 2px;
float: left;
height: 18px;
margin-right: 6px;
}
.drag-sty:hover .iconfont {
color: #2875e8;
}
.iconfont {
color: #a8a7a7;
font-size: 18px;
}
.bg-purple {
background: #d3dce6;
}
.bg-purple-light {
background: #fafafa;
}
.left-shadow {}
.grid-content {
border-radius: 4px;
overflow: auto;
padding: 20px;
}
.tab-content {
border: 1px solid #eee;
border-radius: 4px;
min-height: 736px;
height: 100%;
overflow: auto;
}
.item {
height: 60px;
border: 0px solid #333;
display: inline-block;
/*border-radius: 4px;
border-style: dashed;*/
padding: 10px;
margin-bottom: 5px;
cursor: pointer;
}
.el-upload {
width: 100%;
}
.el-upload-dragger {
width: 100%;
}
.item2 {
height: 80px;
border: 0px solid #333;
padding: 10px;
margin-bottom: 5px;
cursor: pointer;
}
#removeBox {
height: 100px;
width: 100px;
border: 2px dashed #999;
background: rgba(0, 0, 0, 0.3);
position: absolute;
bottom: 10px;
right: 20px;
background: url(/static/image/deleteBox.png) no-repeat;
background-size: 90%;
background-position: center center;
}
.flxed {
position: relative;
top: 0;
left: 0;
}
.edit-content .el-form-item {
margin-bottom: 0;
}
.vali-el-input {
margin: 0 10px;
}
.el-checkbox {
margin: 4px 0;
}
.el-divider--horizontal {
margin: 4px 0;
}
h4,
h5 {
margin: 10px 0;
}
.submit-btn {
float: right;
}
/deep/ .page-item-group {
cursor: pointer;
position: relative;
.control-btn {
right: 0;
}
}
* {
box-sizing: border-box;
}
/deep/ .vali-el-input .el-input__inner {
height: 26px !important;
padding-right: 0;
padding-left: 4px;
}
/deep/ .sel-options .el-form-item__label {
width: 100%;
text-align: left;
}
/deep/ .el-icon-delete {
cursor: pointer;
}
/deep/ .edit-content .el-input {
width: 100%;
}
/deep/ .edit-content .el-date-editor.el-input,
/deep/ .edit-content .el-date-editor.el-input__inner {
width: 220px;
}
/deep/ .edit-content .el-input__inner {
height: 30px;
box-sizing: border-box;
}
/deep/ .edit-content .el-date-editor--date .el-icon-date,
/deep/ .edit-content .time-select .el-icon-circle-close {
line-height: 30px;
}
/deep/ .edit-content .el-form-item__label {
height: 30px;
}
/deep/ .long_input {
margin-left: 80px !important;
position: relative;
}
/deep/ .long_input_label {
width: 80px;
}
/deep/ .page-item:hover {
background: #e0f2ff;
}
/deep/ .page-item-select {
border: 1px dashed #4db8ff;
background: #e0f2ff;
}
/deep/ .page-item-select .control-btn {
display: block;
}
/deep/ .control-btn {
position: absolute;
top: 50%;
right: -20px;
transform: translate(0, -50%);
display: none;
}
/deep/ .control-btn .control-delete {
position: absolute;
right: 0;
bottom: -26px;
}
/deep/ .control-btn .control-arrow-wrap {
height: 20px;
cursor: pointer;
line-height: 20px;
background: #fff;
display: block;
}
/deep/ .control-btn .control-arrow-down {
bottom: -2px;
}
/deep/ .control-btn .control-arrow-up {
top: 28px;
margin-bottom: 6px;
}
/deep/ .tab-content .page-item {
margin-bottom: 0;
padding: 16px 20px;
min-height: 90px;
}
.sel-sty {
width: 200px;
margin-top: 14px;
display: block;
// -webkit-appearance: none;
background-color: #fff;
background-image: none;
border-radius: 4px;
border: 1px solid #dcdfe6;
box-sizing: border-box;
color: #606266;
font-size: inherit;
height: 30px;
line-height: 40px;
outline: none;
padding: 0 15px;
transition: border-color 0.2s cubic-bezier(0.645, 0.045, 0.355, 1);
}
</style>
注意,这里使用到了vue-draggable-resizable这个组件。具体如何引入到Vue中我就不做介绍了,运行效果如下图所示:
上传一个表格图片后,如下图所示:
红色文字不适合展示,我抹掉了。
在设计完毕后,可以查看到设计结果数据,该数据可用于OpenCV的完全识别,代码如下:
List<Row> recognizeFromSettings(File imagePath) {
Mat src = Imgcodecs.imread(imagePath.getAbsolutePath());
String extention = Utils.getUriExtention(imagePath.getPath());
// 1. 生成横桶
List<MatWithProperty> cellProperty = JSON.parseArray(this.settings.getString("cells"), MatWithProperty.class);
List<Row> horBuckets = new ArrayList<>();
createRowBuckets(cellProperty, horBuckets);
// 遍历横桶,
for (Row row : horBuckets) {
RowBucket bucket = (RowBucket) row;
List<MatWithProperty> rowMats = bucket.elements;
// 生成列桶
List<ColBucket> verBuckets = new ArrayList<>();
createColBuckets(rowMats, verBuckets);
for (ColBucket verticalBucket : verBuckets) {
List<MatWithProperty> colMats = verticalBucket.elements;
// 遍历列
for (MatWithProperty mat : colMats) {
Mat subMat = src.submat(new Rect(mat.rect.x, mat.rect.y, mat.rect.width, mat.rect.height)).clone();
try {
// 识别
BufferedImage image = OpenCVUtils.convertMat2BufferedImage(subMat, extention);
String content = tesseract.doOCR(image);
row.addCell(content);
} catch (Exception e) {
logger.error("[OCR Failed]", e);
e.printStackTrace();
} finally {
try {
if (subMat != null) {
subMat.release();
subMat = null;
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
return horBuckets;
}
更多推荐
所有评论(0)