lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siegfried Goeschl <sgoes...@gmx.at>
Subject Re: pdfs
Date Sun, 25 May 2014 08:06:59 GMT
Hi Brian,

can you send me the email? I would like to play around :-)

Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce the issue …


Thanks in advance

Siegfried Goeschl


On 25 May 2014, at 04:18, Brian McDowell <brianmcd08@gmail.com> wrote:

> Our feeding (indexing) tool halts because Solr becomes unresponsive after
> getting some really bad pdfs. There are levels of pdf "badness." Some just
> will not parse and that's fine, but others are more problematic in that our
> Operations team has to restart Solr because it just hangs and accepts no
> more documents. I actually have identified a pdf that will bring down Solr
> every time. Does anyone think that doing pre-validation using the pdfbox
> jar will work? Or, will trying to validate just hang as well? Any help is
> appreciated.
> 
> 
> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <jack@basetechnology.com>wrote:
> 
>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Siegfried Goeschl
>> Sent: Thursday, May 22, 2014 4:35 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: pdfs
>> 
>> 
>> Hi folks,
>> 
>> for a small customer project I'm running SOLR with embedded Tikka.
>> 
>> * memory consumption is an issue but can be handled
>> * there is an issue with PDFBox hitting an infinite loop which causes
>> excessive CPU usage - requires SOLR restart but happens only once
>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>> erratic since I was never able to track the problem back to a particular
>> PDF document
>> 
>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>> consumption goes through the roof
>> 
>> If you doing really serious stuff I would recommend
>> * moving the document extraction stuff out of SOLR
>> * provide monitoring and recovery and stuck document extractions
>> ** killing worker threads
>> ** using external processed and kill them when spinning out of control
>> 
>> Cheers,
>> 
>> Siegfried Goeschl
>> 
>> On 22.05.14 06:46, Jack Krupansky wrote:
>> 
>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>> has improved over the years, but still not likely to be perfect.
>>> 
>>> That said, I'm not aware of any specific PDF extraction issue that would
>>> bring down Solr - as opposed to causing a 500 status with an exception
>>> in PDF extraction, with the exception of memory usage. Some PDF
>>> documents, especially those which are graphic-intense can require a lot
>>> of memory. The rest of Solr could be adversely affected if all available
>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>> 
>>> So, what is your specific symptom?
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Brian McDowell
>>> Sent: Thursday, May 22, 2014 12:24 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: pdfs
>>> 
>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>> Solr completely so that it actually needs to be manually restarted. We are
>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>> problem because the release notes associated with the new tika version and
>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>> would be greatly appreciated. Thank you!
>>> 
>> 
>> 


Mime
View raw message