pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jl...@gi-bon.sk
Subject Re: problems with pdf parsing
Date Fri, 24 Aug 2012 11:44:24 GMT
hi, 
thanks for reply.

your advice helped.



Best regards
Juraj Lonc


GI-BÓN, spol. s r.o.
Management Systems

Bratislavská 11
SK - 010 01 Žilina
Tel: +421-41-564 3437-8
Mobil: +421-907-815 147
Fax: +421-41-564 3439
e-mail: jlonc@gi-bon.sk
homepage: http://www.gi-bon.sk 





From:   Andreas Lehmkuehler <andreas@lehmi.de>
To:     users@pdfbox.apache.org, 
Date:   23. 08. 2012 18:21
Subject:        Re: problems with pdf parsing



Hi,

Am 16.08.2012 16:11, schrieb jlonc@gi-bon.sk:
> hi,
>
> i'm trying to load some sample pdf documents but only 1 of 4 is parsed 
by
> pdfbox without exception.
> adobe reader opens all those pdf documents without any sign of problems.
>
>
> public static void main(String[] args) throws Exception {
>                  InputStream ins=TestGetTexts.class.getResourceAsStream(
> "/034352.pdf");  // sample document
>
>                  PDFParser parser=new PDFParser(ins);
>                  parser.parse();
>                  COSDocument cosDoc=parser.getDocument();
>                  PDDocument pdDoc = new PDDocument(cosDoc);
>
> }
First of all, you should use one of the static load-methods provided by 
PDDocument.

                 InputStream 
ins=TestGetTexts.class.getResourceAsStream("/034352.pdf");
                 PDDocument pdDoc = PDDocument.load(ins);


> it throws exceptions at line "parser.parse();"
> what is wrong with that?
Hard to say without having a hand on one of these pdfs. Did you ever try 
the new 
non-sequential parser (use loadNonSeq instead of load )?

> 16.8.2012 15:49:49 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 252 is wrong. Fall back to reading 
stream
> until 'endstream'.
> 16.8.2012 15:49:49 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 34 is wrong. Fall back to reading 
stream
> until 'endstream'.
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> 16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
> SEVERE: FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> Exception in thread "main" java.io.IOException
>          at org.apache.pdfbox.filter.FlateFilter.decode(
> FlateFilter.java:138)
>          at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
>          at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>          at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(
> COSStream.java:156)
>          at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(
> PDFXrefStreamParser.java:61)
>          at org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(
> PDFParser.java:846)
>          at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
> PDFParser.java:574)
>          at 
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
>          at test.TestGetTexts.main(TestGetTexts.java:20)
> Caused by: java.util.zip.DataFormatException: incorrect header check
>          at java.util.zip.Inflater.inflateBytes(Native Method)
>          at java.util.zip.Inflater.inflate(Inflater.java:238)
>          at java.util.zip.Inflater.inflate(Inflater.java:256)
>          at org.apache.pdfbox.filter.FlateFilter.decompress(
> FlateFilter.java:169)
>          at 
org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98
> )
>          ... 8 more
>
>
> the other pdf:
>
> 16.8.2012 16:08:44 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 4192 is wrong. Fall back to reading
> stream until 'endstream'.
> 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 576 is wrong. Fall back to reading 
stream
> until 'endstream'.
> 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 432 is wrong. Fall back to reading 
stream
> until 'endstream'.
> 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 304 is wrong. Fall back to reading 
stream
> until 'endstream'.
> 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 480 is wrong. Fall back to reading 
stream
> until 'endstream'.
> 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 176 is wrong. Fall back to reading 
stream
> until 'endstream'.
> 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 2096 is wrong. Fall back to reading
> stream until 'endstream'.
> 16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 137440 is wrong. Fall back to reading
> stream until 'endstream'.
> Exception in thread "main" 
org.apache.pdfbox.exceptions.WrappedIOException
> : Could not push back 137440 bytes in order to reparse stream. Try
> increasing push back buffer using system property
> org.apache.pdfbox.baseParser.pushBackSize
>          at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
> BaseParser.java:546)
>          at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
> PDFParser.java:566)
>          at 
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
>          at test.TestGetTexts.main(TestGetTexts.java:20)
> Caused by: java.io.IOException: Push back buffer is full
>          at 
java.io.PushbackInputStream.unread(PushbackInputStream.java:215
> )
>          at org.apache.pdfbox.io.PushBackInputStream.unread(
> PushBackInputStream.java:144)
>          at org.apache.pdfbox.io.PushBackInputStream.unread(
> PushBackInputStream.java:133)
>          at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
> BaseParser.java:542)
>          ... 3 more
>
>
>
> or:
>
> 16.8.2012 16:10:27 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 8 is wrong. Fall back to reading stream
> until 'endstream'.
> 16.8.2012 16:10:27 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
> WARNING: Specified stream length 77788 is wrong. Fall back to reading
> stream until 'endstream'.
> Exception in thread "main" 
org.apache.pdfbox.exceptions.WrappedIOException
> : Could not push back 77788 bytes in order to reparse stream. Try
> increasing push back buffer using system property
> org.apache.pdfbox.baseParser.pushBackSize
>          at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
> BaseParser.java:546)
>          at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
> PDFParser.java:566)
>          at 
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
>          at test.TestGetTexts.main(TestGetTexts.java:21)
> Caused by: java.io.IOException: Push back buffer is full
>          at 
java.io.PushbackInputStream.unread(PushbackInputStream.java:215
> )
>          at org.apache.pdfbox.io.PushBackInputStream.unread(
> PushBackInputStream.java:144)
>          at org.apache.pdfbox.io.PushBackInputStream.unread(
> PushBackInputStream.java:133)
>          at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
> BaseParser.java:542)
>          ... 3 more
>
> best regards
> Juraj Lonc


BR
Andreas Lehmkühler



Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message