pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jl...@gi-bon.sk
Subject problems with pdf parsing
Date Thu, 16 Aug 2012 14:11:08 GMT
hi,

i'm trying to load some sample pdf documents but only 1 of 4 is parsed by 
pdfbox without exception.
adobe reader opens all those pdf documents without any sign of problems.


public static void main(String[] args) throws Exception {
                InputStream ins=TestGetTexts.class.getResourceAsStream(
"/034352.pdf");  // sample document
 
                PDFParser parser=new PDFParser(ins);
                parser.parse();
                COSDocument cosDoc=parser.getDocument();
                PDDocument pdDoc = new PDDocument(cosDoc);
 
}


it throws exceptions at line "parser.parse();"
what is wrong with that?


16.8.2012 15:49:49 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 252 is wrong. Fall back to reading stream 
until 'endstream'.
16.8.2012 15:49:49 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 34 is wrong. Fall back to reading stream 
until 'endstream'.
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
16.8.2012 15:49:49 org.apache.pdfbox.filter.FlateFilter decode
SEVERE: FlateFilter: stop reading corrupt stream due to a 
DataFormatException
Exception in thread "main" java.io.IOException
        at org.apache.pdfbox.filter.FlateFilter.decode(
FlateFilter.java:138)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
        at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(
COSStream.java:156)
        at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(
PDFXrefStreamParser.java:61)
        at org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(
PDFParser.java:846)
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
PDFParser.java:574)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
        at test.TestGetTexts.main(TestGetTexts.java:20)
Caused by: java.util.zip.DataFormatException: incorrect header check
        at java.util.zip.Inflater.inflateBytes(Native Method)
        at java.util.zip.Inflater.inflate(Inflater.java:238)
        at java.util.zip.Inflater.inflate(Inflater.java:256)
        at org.apache.pdfbox.filter.FlateFilter.decompress(
FlateFilter.java:169)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98
)
        ... 8 more


the other pdf:

16.8.2012 16:08:44 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 4192 is wrong. Fall back to reading 
stream until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 576 is wrong. Fall back to reading stream 
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 432 is wrong. Fall back to reading stream 
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 304 is wrong. Fall back to reading stream 
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 480 is wrong. Fall back to reading stream 
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 176 is wrong. Fall back to reading stream 
until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 2096 is wrong. Fall back to reading 
stream until 'endstream'.
16.8.2012 16:08:45 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 137440 is wrong. Fall back to reading 
stream until 'endstream'.
Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException
: Could not push back 137440 bytes in order to reparse stream. Try 
increasing push back buffer using system property 
org.apache.pdfbox.baseParser.pushBackSize
        at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
BaseParser.java:546)
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
PDFParser.java:566)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
        at test.TestGetTexts.main(TestGetTexts.java:20)
Caused by: java.io.IOException: Push back buffer is full
        at java.io.PushbackInputStream.unread(PushbackInputStream.java:215
)
        at org.apache.pdfbox.io.PushBackInputStream.unread(
PushBackInputStream.java:144)
        at org.apache.pdfbox.io.PushBackInputStream.unread(
PushBackInputStream.java:133)
        at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
BaseParser.java:542)
        ... 3 more



or:

16.8.2012 16:10:27 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 8 is wrong. Fall back to reading stream 
until 'endstream'.
16.8.2012 16:10:27 org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 77788 is wrong. Fall back to reading 
stream until 'endstream'.
Exception in thread "main" org.apache.pdfbox.exceptions.WrappedIOException
: Could not push back 77788 bytes in order to reparse stream. Try 
increasing push back buffer using system property 
org.apache.pdfbox.baseParser.pushBackSize
        at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
BaseParser.java:546)
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(
PDFParser.java:566)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
        at test.TestGetTexts.main(TestGetTexts.java:21)
Caused by: java.io.IOException: Push back buffer is full
        at java.io.PushbackInputStream.unread(PushbackInputStream.java:215
)
        at org.apache.pdfbox.io.PushBackInputStream.unread(
PushBackInputStream.java:144)
        at org.apache.pdfbox.io.PushBackInputStream.unread(
PushBackInputStream.java:133)
        at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(
BaseParser.java:542)
        ... 3 more





best regards
Juraj Lonc

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message