pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: Corrupted PDF file causing severe OOM
Date Wed, 15 May 2019 13:54:09 GMT
Sounds like it might be a bug.

PDFBox colleagues, any recs?

Slava, if you’re able to share the file even if only privately, that’ll
help.

On Wed, May 15, 2019 at 9:49 AM Slava G <slavago@gmail.com> wrote:

> I have small pdf file (142kb) while I'm trying to parse it with TIKA my
> entire app is crashing on OOM with heap dump on 36gb (nothing else in the
> code, hust parsing this PDF).
> With possible error : FlateFilter: stop reading corrupt stream due to a
> DataFormatException
> And stack trace (at the moment of OOM):
> "main" #1 prio=5 os_prio=0 tid=0x00007f6460009000 nid=0x4876 waiting for
> monitor entry [0x00007f646680d000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.util.HashMap.newNode(HashMap.java:1734)
>         at java.util.HashMap.putVal(HashMap.java:630)
>         at java.util.HashMap.put(HashMap.java:611)
>         at org.apache.fontbox.cmap.CMap.addCharMapping(CMap.java:191)
>         at
> org.apache.fontbox.cmap.CMapParser.parseBeginbfrange(CMapParser.java:398)
>         at org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:136)
>         at
> org.apache.pdfbox.pdmodel.font.CMapManager.parseCMap(CMapManager.java:75)
>         at org.apache.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:197)
>         at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:137)
>         at
> org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:176)
>         at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:83)
>         at
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
>         at
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>         at
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>         at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>         at
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>         at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>         at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>         at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
>
> Please advise how can I detect that this can happen and skip such file
> from the parsing. Or this is a bug ?
>
> Thanks
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message