java读取pdf文件内容
linux-dash
A beautiful web dashboard for Linux
项目地址:https://gitcode.com/gh_mirrors/li/linux-dash
·
java读取pdf文件内容
在java中要读取pdf文件内容,我们可以借助第三方软件实现。常用的是xpdf,本文就简单介绍在linux下如何安装xpdf,及在java中如何利用xpdf读取pdf文件内容。一.安装xpdf
在fc系列下,不用安装,可以直接yum,但是笔者建议还是下载安装的好,因为笔者曾经碰到过这样的问题,客户服务器上的xpdf是yum安装的,有一些特殊的pdf文件就无法预览,但是将yum安装的xpdf卸载,然后下载xpdf安装程序,再重新安装后,就可以了。
1.下载
ok,我们需要下载的xpdf安装包主要有三个:
(1)进入下载目录,将主程序解压至/usr,也可以是其他地方,根据个人情况而定。
# tar zvfx xpdf- 3 . 01pl2-linux . tar . gz -C / usr
# cd usr
然后将其重命名,这样看起来简单点
mv xpdf- 3 . 01pl2-linux / xpdf
(2)建立中文支持。回到下载目录,依次执行:
# tar zvfx xpdf-chinese-simplified . tar . gz -C / usr / xpdf
# mv / usr / xpdf / xpdf-chinese-simplified / usr / xpdf / chinese-simplified
# tar zvfx xpdf-chinese-traditional . tar . gz -C / usr / xpdf
# mv / usr / xpdf / xpdf-chinese-traditional / usr / xpdf / chinese-traditional
(3)配置环境
# vi / etc / bashrc
增加如下内容
export PATH=/usr/xpdf/:$PATH
确保重启机器后,在控制台输入xpdf不会提示找不到命令或文件即可。
(4)资源配置
# cd / usr / xpdf
# cp sample-xpdfrc xpdfrc
# vi xpdfrc
*在文件开始处增加如下内容(将/usr/xpdf替换为xpdf的实际路径)*
# ----- begin Chinese Simplified support package ( 2004 -jul- 27 )
cidToUnicode Adobe-GB1 " /usr/xpdf/chinese-simplified/Adobe-GB1.cidToUnicode "
unicodeMap ISO- 2022 -CN " /usr/xpdf/chinese-simplified/ISO-2022-CN.unicodeMap "
unicodeMap EUC-CN " /usr/xpdf/chinese-simplified/EUC-CN.unicodeMap "
unicodeMap GBK " /usr/xpdf/chinese-simplified/GBK.unicodeMap "
cMapDir Adobe-GB1 " /usr/xpdf/chinese-simplified/CMap "
toUnicodeDir " /usr/xpdf/chinese-simplified/CMap "
# displayCIDFontTT Adobe-GB1 / usr /..../ gkai00mp . ttf
# ----- end Chinese Simplified support package
# ----- begin Chinese Traditional support package ( 2004 -jul- 27 )
cidToUnicode Adobe-CNS1 " /usr/xpdf/chinese-traditional/Adobe-CNS1.cidToUnicode "
unicodeMap Big5 " /usr/xpdf/chinese-traditional/Big5.unicodeMap "
unicodeMap Big5ascii " /usr/xpdf/chinese-traditional/Big5ascii.unicodeMap "
cMapDir Adobe-CNS1 " /usr/xpdf/chinese-traditional/CMap "
toUnicodeDir " /usr/xpdf/chinese-traditional/CMap "
# displayCIDFontTT Adobe-CNS1 / usr /..../ bkai00mp . ttf
# ----- end Chinese Traditional support package
然后再执行:
# cp xpdfrc / usr / local / etc /
好了,到这里我们也就安装完成了。下面介绍如何利用xpdf读取pdf文件的内容
二.利用xpdf读取pdf文件的内容
1.下载
ok,我们需要下载的xpdf安装包主要有三个:
主程序: ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.01pl2-linux.tar.gz
简体中文支持: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-simplified.tar.gz
繁体中文支持: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-traditional.tar.gz
2.安装部署
简体中文支持: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-simplified.tar.gz
繁体中文支持: ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-traditional.tar.gz
(1)进入下载目录,将主程序解压至/usr,也可以是其他地方,根据个人情况而定。
# tar zvfx xpdf- 3 . 01pl2-linux . tar . gz -C / usr
# cd usr
mv xpdf- 3 . 01pl2-linux / xpdf
# tar zvfx xpdf-chinese-simplified . tar . gz -C / usr / xpdf
# mv / usr / xpdf / xpdf-chinese-simplified / usr / xpdf / chinese-simplified
# tar zvfx xpdf-chinese-traditional . tar . gz -C / usr / xpdf
# mv / usr / xpdf / xpdf-chinese-traditional / usr / xpdf / chinese-traditional
# vi / etc / bashrc
export PATH=/usr/xpdf/:$PATH
确保重启机器后,在控制台输入xpdf不会提示找不到命令或文件即可。
(4)资源配置
# cd / usr / xpdf
# cp sample-xpdfrc xpdfrc
# vi xpdfrc
# ----- begin Chinese Simplified support package ( 2004 -jul- 27 )
cidToUnicode Adobe-GB1 " /usr/xpdf/chinese-simplified/Adobe-GB1.cidToUnicode "
unicodeMap ISO- 2022 -CN " /usr/xpdf/chinese-simplified/ISO-2022-CN.unicodeMap "
unicodeMap EUC-CN " /usr/xpdf/chinese-simplified/EUC-CN.unicodeMap "
unicodeMap GBK " /usr/xpdf/chinese-simplified/GBK.unicodeMap "
cMapDir Adobe-GB1 " /usr/xpdf/chinese-simplified/CMap "
toUnicodeDir " /usr/xpdf/chinese-simplified/CMap "
# displayCIDFontTT Adobe-GB1 / usr /..../ gkai00mp . ttf
# ----- end Chinese Simplified support package
# ----- begin Chinese Traditional support package ( 2004 -jul- 27 )
cidToUnicode Adobe-CNS1 " /usr/xpdf/chinese-traditional/Adobe-CNS1.cidToUnicode "
unicodeMap Big5 " /usr/xpdf/chinese-traditional/Big5.unicodeMap "
unicodeMap Big5ascii " /usr/xpdf/chinese-traditional/Big5ascii.unicodeMap "
cMapDir Adobe-CNS1 " /usr/xpdf/chinese-traditional/CMap "
toUnicodeDir " /usr/xpdf/chinese-traditional/CMap "
# displayCIDFontTT Adobe-CNS1 / usr /..../ bkai00mp . ttf
# ----- end Chinese Traditional support package
# cp xpdfrc / usr / local / etc /
方法很简单,利用著名的Runtime.getRuntime()即可,如下:
/**
* @param filePath pdf文件路径
* @return
*/
public String getPdfContent(String filePath) {
String excute="pdftotext";

String[] cmd=new String[]{excute, "-enc", "UTF-8", "-q", filePath,"-"};
Process p=null;
try {
p=Runtime.getRuntime().exec(cmd);
} catch (IOException e) {
e.printStackTrace();
}

BufferedInputStream bis=new BufferedInputStream(p.getInputStream());

InputStreamReader reader=null;

try {
reader=new InputStreamReader(bis,"UTF-8");
} catch (UnsupportedEncodingException e1) {
e1.printStackTrace();
}

StringBuffer sb=new StringBuffer();

try {
BufferedReader br = new BufferedReader(reader);
String line = br.readLine();
sb = new StringBuffer();
while (line != null) {
sb.append(line);
sb.append(" ");
line = br.readLine();
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
/**
* @param filePath pdf文件路径
* @return
*/
public String getPdfContent(String filePath) {
String excute="pdftotext";
String[] cmd=new String[]{excute, "-enc", "UTF-8", "-q", filePath,"-"};
Process p=null;
try {
p=Runtime.getRuntime().exec(cmd);
} catch (IOException e) {
e.printStackTrace();
}
BufferedInputStream bis=new BufferedInputStream(p.getInputStream());
InputStreamReader reader=null;
try {
reader=new InputStreamReader(bis,"UTF-8");
} catch (UnsupportedEncodingException e1) {
e1.printStackTrace();
}
StringBuffer sb=new StringBuffer();
try {
BufferedReader br = new BufferedReader(reader);
String line = br.readLine();
sb = new StringBuffer();
while (line != null) {
sb.append(line);
sb.append(" ");
line = br.readLine();
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
A beautiful web dashboard for Linux
最近提交(Master分支:4 个月前 )
186a802e
added ecosystem file for PM2 5 年前
5def40a3
Add host customization support for the NodeJS version 5 年前
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐
String[] cmd

所有评论(0)