lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siegfried Goeschl <>
Subject Re: pdfs
Date Thu, 22 May 2014 08:35:03 GMT
Hi folks,

for a small customer project I'm running SOLR with embedded Tikka.

* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes 
excessive CPU usage - requires SOLR restart but happens only once 
withing 400.000 documents (PDF, Word, ect) but is seems a little bit 
erratic since I was never able to track the problem back to a particular 
PDF document

Having said that we wire SOLR with Nagios to get an alarm when CPU 
consumption goes through the roof

If you doing really serious stuff I would recommend
* moving the document extraction stuff out of SOLR
* provide monitoring and recovery and stuck document extractions
** killing worker threads
** using external processed and kill them when spinning out of control


Siegfried Goeschl

On 22.05.14 06:46, Jack Krupansky wrote:
> Yeah, PDF extraction has always been at least somewhat problematic. It
> has improved over the years, but still not likely to be perfect.
> That said, I'm not aware of any specific PDF extraction issue that would
> bring down Solr - as opposed to causing a 500 status with an exception
> in PDF extraction, with the exception of memory usage. Some PDF
> documents, especially those which are graphic-intense can require a lot
> of memory. The rest of Solr could be adversely affected if all available
> JVM heap is consumed. The solution is to give the JVM more heap space.
> So, what is your specific symptom?
> -- Jack Krupansky
> -----Original Message----- From: Brian McDowell
> Sent: Thursday, May 22, 2014 12:24 AM
> To:
> Subject: pdfs
> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
> Solr completely so that it actually needs to be manually restarted. We are
> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
> problem because the release notes associated with the new tika version and
> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
> this issue is causing us to reevaluate using Solr. Any help on this matter
> would be greatly appreciated. Thank you!

View raw message