pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Slava G <slav...@gmail.com>
Subject Re: Corrupted PDF file causing severe OOM
Date Wed, 15 May 2019 14:35:17 GMT
Will definitely try, is this rc available via maven?

On Wed, May 15, 2019, 17:20 Tim Allison <tallison@apache.org> wrote:

> Yay! Tilman and colleagues on PDFBox really are _that_fast. :)
>
>   You can try Tika’s integration w 2.0.15 in our 1.21-rc2:
>
> https://lists.apache.org/thread.html/2c027535156cc6862149490b289552d72ba5a9bff985fb7cce794e21@%3Cdev.tika.apache.org%3E
>
> On Wed, May 15, 2019 at 10:01 AM Slava G <slavago@gmail.com> wrote:
>
> > Sure, I can share it privately.
> > But seems that in PDFBox 2.0.15 it's already fixed as, when I run
> tika-app
> > (1.20) it's caused same issue, but when I ran extractText in PDFBox
> 2.0.15
> > I got next :
> > May 15, 2019 4:59:11 PM org.apache.pdfbox.filter.FlateFilter decompress
> > WARNING: FlateFilter: premature end of stream due to a
> DataFormatException
> > May 15, 2019 4:59:11 PM org.apache.pdfbox.filter.FlateFilter decode
> > SEVERE: FlateFilter: stop reading corrupt stream due to a
> > DataFormatException
> > Exception in thread "main" java.io.IOException:
> > java.util.zip.DataFormatException: invalid literal/lengths set
> > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
> > at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
> > at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77)
> > at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
> > at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
> > at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
> > at
> >
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91)
> > at
> >
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
> > at
> >
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
> > at
> >
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
> > at
> >
> >
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> > at
> >
> >
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> > at
> >
> >
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> > at
> >
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> > at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:375)
> > at
> > org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:272)
> > at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)
> > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> > Caused by: java.util.zip.DataFormatException: invalid literal/lengths set
> > at java.util.zip.Inflater.inflateBytes(Native Method)
> > at java.util.zip.Inflater.inflate(Inflater.java:259)
> > at java.util.zip.Inflater.inflate(Inflater.java:280)
> > at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
> > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
> > ... 17 more
> >
> >
> > On Wed, May 15, 2019 at 4:54 PM Tim Allison <tallison@apache.org> wrote:
> >
> > > Sounds like it might be a bug.
> > >
> > > PDFBox colleagues, any recs?
> > >
> > > Slava, if you’re able to share the file even if only privately, that’ll
> > > help.
> > >
> > > On Wed, May 15, 2019 at 9:49 AM Slava G <slavago@gmail.com> wrote:
> > >
> > > > I have small pdf file (142kb) while I'm trying to parse it with TIKA
> my
> > > > entire app is crashing on OOM with heap dump on 36gb (nothing else in
> > the
> > > > code, hust parsing this PDF).
> > > > With possible error : FlateFilter: stop reading corrupt stream due
> to a
> > > > DataFormatException
> > > > And stack trace (at the moment of OOM):
> > > > "main" #1 prio=5 os_prio=0 tid=0x00007f6460009000 nid=0x4876 waiting
> > for
> > > > monitor entry [0x00007f646680d000]
> > > >    java.lang.Thread.State: BLOCKED (on object monitor)
> > > >         at java.util.HashMap.newNode(HashMap.java:1734)
> > > >         at java.util.HashMap.putVal(HashMap.java:630)
> > > >         at java.util.HashMap.put(HashMap.java:611)
> > > >         at org.apache.fontbox.cmap.CMap.addCharMapping(CMap.java:191)
> > > >         at
> > > >
> > org.apache.fontbox.cmap.CMapParser.parseBeginbfrange(CMapParser.java:398)
> > > >         at
> > org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:136)
> > > >         at
> > > >
> > org.apache.pdfbox.pdmodel.font.CMapManager.parseCMap(CMapManager.java:75)
> > > >         at
> > > org.apache.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:197)
> > > >         at
> > org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:137)
> > > >         at
> > > >
> org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:176)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:83)
> > > >         at
> > > > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> > > >         at
> > > > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> > > >         at
> > > >
> > >
> >
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> > > >         at
> > > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> > > >         at
> > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> > > >
> > > >
> > > > Please advise how can I detect that this can happen and skip such
> file
> > > > from the parsing. Or this is a bug ?
> > > >
> > > > Thanks
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message