pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Sackin <dsac...@gmail.com>
Subject Re: OutOfMemoryError from FlatFilter (could be PDFBOX-453 again)
Date Tue, 09 Apr 2013 14:40:04 GMT
Per Maruan's suggestion, I tracked down the bad file and ran the
ExtractText command line utility on it. I only had access to pdfbox-1.7.1
on the system. The file definitely appears to be corrupt. I open it either
using Adobe or PDFBox ExtractText.

Using java -jar pdfbox-app-1.7.1.jar ExtractText bad_file.pdf, I get:

java.io.IOException: Error: Header doesn't contain versioninfo
    at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:315)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:980)
    at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:211)
    at org.apache.pdfbox.ExtractText.main(ExtractText.java:84)
    at org.apache.pdfbox.PDFBox.main(ExtractText.java:42)


Using java -jar pdfbox-app-1.7.1.jar ExtractText *-nonSeq *bad_file.pdf, I
get:

java.io.IOException: Error: Missing end of file marker '%%EOF'
    at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.getStartxrefOffset(NonSequentialPDFParser.java:456)
    at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:233)
    at
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
    at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
    at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:207)
    at org.apache.pdfbox.ExtractText.main(ExtractText.java:84)
    at org.apache.pdfbox.PDFBox.main(ExtractText.java:42)

Now I will try a small app using the utility classes from the stack trace
to see if the same exception shows up.

Thank you for the tip.

Doug


On Tue, Apr 9, 2013 at 8:52 AM, Doug Sackin <dsackin@gmail.com> wrote:

> Has anyone else encountered recent problems with FlateFilter and
> OutOfMemory errors? Is there anyway to trap it before it results in
> OutOfMemory exception?
>
> Thanks
>
> Doug
>
>
> On Mon, Apr 1, 2013 at 2:13 PM, Doug Sackin <dsackin@gmail.com> wrote:
>
>> I appear to have something similar to the bug identified and fixed in
>> PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError.
>>
>> I'm doing text extraction through Twister Data Framework using Tika 1.2
>> which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java
>> is JDK 1.6.0_37.
>>
>> The offending exception is below:
>>
>> Caused by: java.lang.OutOfMemoryError
>>     at java.util.zip.Inflater.inflateBytes(Native Method)
>>     at java.util.zip.Inflater.inflate(Inflater.java:238)
>>     at java.util.zip.Inflater.inflate(Inflater.java:256)
>>     at
>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
>>     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
>>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
>>     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>>     at
>> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>>     at
>> org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196)
>>     at
>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
>>     at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253)
>>     at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
>>     at
>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
>>     at
>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
>>     at
>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
>>     at
>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
>>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>>     at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>     at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>     at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>     at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>     at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>     at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>>
>> Before that, I have a long string of exceptions from PDFBox attempts on
>> PDF files, interspersed by "FlateFilter: stop reading corrupt stream due to
>> a DataFormatException". These are in the attached log file.
>>
>> The other exceptions are IndexOutOfBounds, ClassCastException,
>> NegativeArraySizeException, NullPointerException, IOException (regarding
>> font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not
>> be related (the exceptions are appearing on different files), but I wonder
>> if they served to corrupt the stream sufficiently that PDFBox got attempted
>> to inflate corrupt data.
>>
>> If it is the same issue, it was reported to be fixed in 0.8. If it is a
>> new issue, is it possible to fix it? I cannot provide any of the source PDF
>> files (client data), but I am attaching the log output containing all of
>> the exception traces including the final OutOfMemoryError.
>>
>> Thanks for any insights.
>>
>> Doug
>>
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message