pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 叶严杰 <huoyanyo...@gmail.com>
Subject Re: bug report for v1.6.0
Date Wed, 09 May 2012 18:15:36 GMT
..url for the pdf file:
http://www.aclweb.org/anthology-new/P/P02/P02-1046.pdf

On Thu, May 10, 2012 at 1:26 AM, 叶严杰 <huoyanyouli@gmail.com> wrote:

> I tried to get text from a pdf with pdfbox by striper.getText. (see code
> attached below)
> the pdf is attached as file. And bug info attached below.
> anyway to solve this bug?
>
> regrads
>
> *Code*
>     public void read()
>     {
>         PDDocument document = null;
>         FileInputStream is = null;
>         try {
>             is = new FileInputStream(file);
>             PDFParser parser = new PDFParser(is);
>             parser.parse();
>             document = parser.getPDDocument();
>             PDFTextStripper stripper = new PDFTextStripper();
>             content = stripper.getText(document);
>         } catch (FileNotFoundException e) {
>             e.printStackTrace();
>         } catch (IOException e) {
>             e.printStackTrace();
>         } finally {
>             if (is != null) {
>                 try {
>                     is.close();
>                 } catch (IOException e) {
>                     e.printStackTrace();
>                 }
>             }
>             if (document != null) {
>                 try {
>                     document.close();
>                 } catch (IOException e) {
>                     e.printStackTrace();
>                 }
>             }
>         }
>     }
>
> *Bug Info*
> Exception in thread "main" java.lang.NumberFormatException: For input
> string: "dup"
>     at java.lang.NumberFormatException.forInputString(Unknown Source)
>     at java.lang.Integer.parseInt(Unknown Source)
>     at java.lang.Integer.parseInt(Unknown Source)
>     at
> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
>     at
> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
>     at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
>     at
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
>     at
> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
>     at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
>     at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
>     at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
>     at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
>     at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>     at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>     at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>     at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>     at
> org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
>     at get.read(get.java:33)
>     at get.main(get.java:60)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message