manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) error SPAM 10Go/hour
Date Tue, 29 May 2018 16:13:35 GMT
This is indeed a Tika bug, or a bug in the underlying PDFBox code it uses.

In order to make progress, we need a sample document that demonstrates the
problem.  Once we have that, I can open a Tika ticket.

Thanks,
Karl


On Tue, May 29, 2018 at 12:06 PM msaunier <msaunier@citya.com> wrote:

> Hello Karl,
>
>
>
> PS: at this moment, I have 24 document bloqued. 20 status
> «Processing » and 4 status « About to Process ».
>
>
>
> So, I have test and they are they sames. So, I have import the file and
> used tika-app.jar to test in local and I have this error for they files:
>
>
>
> WARN  Invalid XObject Subtype: null
>
> WARN  Invalid XObject Subtype: null
>
> WARN  Invalid XObject Subtype: null
>
> …
>
> WARN  Invalid XObject Subtype: null
>
> WARN  Invalid XObject Subtype: null
>
> WARN  Invalid XObject Subtype: null
>
> WARN  Invalid XObject Subtype: null
>
> Exception in thread "main" java.lang.StackOverflowError
>
>         at java.util.zip.Inflater.<init>(Inflater.java:102)
>
>         at
> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:99)
>
>         at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
>
>         at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
>
>         at
> org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77)
>
>         at
> org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
>
>         at
> org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
>
>         at
> org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject.getContents(PDFormXObject.java:144)
>
>         at
> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:493)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
>
>         at
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
>
>         at
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
>
> …
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
>
>         at
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
>
>         at
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
>
>         at
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
>
>         at
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>         at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTransparencyGroup(PDFStreamEngine.java:163)
>
>         at
> org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:60)
>
>
>
> If I open the file with « Edge », it’s good.
>
>
>
> Any idea?
>
>
>
> Thanks,
>
> Maxence,
>
>
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* lundi 28 mai 2018 18:47
> *À :* user@manifoldcf.apache.org
> *Objet :* Re:
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
> error SPAM 10Go/hour
>
>
>
> This sounds potentially like a problem in Tika, but in order to be sure I
> would need a complete stack trace, not just a piece of one.
>
> If it is a Tika issue, it should appear reliably on the same document,
> again and again.
>
>
>
> Is there any way you can crawl ONLY one of the documents that got
> blocked?  I suspect that when you paused and restarted, you just postponed
> the problem and it will happen again.
>
>
>
> Karl
>
>
>
>
>
> On Mon, May 28, 2018 at 9:50 AM msaunier <msaunier@citya.com> wrote:
>
> Hello Karl,
>
>
>
> In Manifoldcf 2.9 for all jobs at the end of the job, several documents,
> around twenty, remain blocked.
>
> A single error appears and it spam the logs of several gigabytes in a
> short time which filled the servers :
>
>
>
> [?:?]
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)
> ~[?:?]
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
> ~[?:?]
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:231)
> ~[?:?]
>
>
>
> If I paused the job and start, documents are send and it working. But, if
> I’m not there, we have problems.
>
>
>
> Do you now this problem and do you have a solution ? It’s a bad
> configuration ?
>
>
>
> Thanks you.
>
>

Mime
View raw message