manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msaunier <>
Subject RE: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator( error SPAM 10Go/hour
Date Wed, 30 May 2018 09:34:06 GMT
Hello Karl,

I have check they files and our provider make a mistake in generating PDF for this server.
We have null joined scan parameter.

We have similare errors with others server with no error log. I will also look.


So, for this PDF error it’s ok, it’s just an error.

For they other servers I check and I'm coming back towards you.






De : Karl Wright [] 
Envoyé : lundi 28 mai 2018 18:47
À :
Objet : Re: org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(
error SPAM 10Go/hour


This sounds potentially like a problem in Tika, but in order to be sure I would need a complete
stack trace, not just a piece of one.

If it is a Tika issue, it should appear reliably on the same document, again and again.


Is there any way you can crawl ONLY one of the documents that got blocked?  I suspect that
when you paused and restarted, you just postponed the problem and it will happen again.





On Mon, May 28, 2018 at 9:50 AM msaunier < <>
> wrote:

Hello Karl,


In Manifoldcf 2.9 for all jobs at the end of the job, several documents, around twenty, remain

A single error appears and it spam the logs of several gigabytes in a short time which filled
the servers :



               at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(

               at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(

               at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(


If I paused the job and start, documents are send and it working. But, if I’m not there,
we have problems.


Do you now this problem and do you have a solution ? It’s a bad configuration ?


Thanks you.

View raw message