pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Sackin <dsac...@gmail.com>
Subject OutOfMemoryError from FlatFilter (could be PDFBOX-453 again)
Date Mon, 01 Apr 2013 18:13:25 GMT
I appear to have something similar to the bug identified and fixed in
PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError.

I'm doing text extraction through Twister Data Framework using Tika 1.2
which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java
is JDK 1.6.0_37.

The offending exception is below:

Caused by: java.lang.OutOfMemoryError
    at java.util.zip.Inflater.inflateBytes(Native Method)
    at java.util.zip.Inflater.inflate(Inflater.java:238)
    at java.util.zip.Inflater.inflate(Inflater.java:256)
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
    at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
    at
org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196)
    at
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
    at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
    at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

Before that, I have a long string of exceptions from PDFBox attempts on PDF
files, interspersed by "FlateFilter: stop reading corrupt stream due to a
DataFormatException". These are in the attached log file.

The other exceptions are IndexOutOfBounds, ClassCastException,
NegativeArraySizeException, NullPointerException, IOException (regarding
font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not
be related (the exceptions are appearing on different files), but I wonder
if they served to corrupt the stream sufficiently that PDFBox got attempted
to inflate corrupt data.

If it is the same issue, it was reported to be fixed in 0.8. If it is a new
issue, is it possible to fix it? I cannot provide any of the source PDF
files (client data), but I am attaching the log output containing all of
the exception traces including the final OutOfMemoryError.

Thanks for any insights.

Doug

Mime
View raw message