lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: pdfs
Date Mon, 26 May 2014 16:20:24 GMT

Yeah, if you can share the PDF that would be great. Parsing via Tika should
not bring down Solr, although I supposed there could be something in Tika
that is pathologically bad.

You could also try using Tika itself in SolrJ and indexing from a client. That
might let you
1> more gracefully handle this without shutting down Solr
2> use different versions of Tika.

Personally I like offloading the document parsing to clients anyway since it
lessens the load on the Solr server and scales much better, but YMMV.

It's not actually very difficult, here's a skeleton (rip out the DB parts)


On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <> wrote:
> Sorry typo :- can you send me the PDF by email directly :-)
> Siegfried Goeschl
> On 25 May 2014, at 10:06, Siegfried Goeschl <> wrote:
>> Hi Brian,
>> can you send me the email? I would like to play around :-)
>> Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce the
issue …
>> Thanks in advance
>> Siegfried Goeschl
>> On 25 May 2014, at 04:18, Brian McDowell <> wrote:
>>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>>> will not parse and that's fine, but others are more problematic in that our
>>> Operations team has to restart Solr because it just hangs and accepts no
>>> more documents. I actually have identified a pdf that will bring down Solr
>>> every time. Does anyone think that doing pre-validation using the pdfbox
>>> jar will work? Or, will trying to validate just hang as well? Any help is
>>> appreciated.
>>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <>wrote:
>>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>>> -- Jack Krupansky
>>>> -----Original Message----- From: Siegfried Goeschl
>>>> Sent: Thursday, May 22, 2014 4:35 AM
>>>> To:
>>>> Subject: Re: pdfs
>>>> Hi folks,
>>>> for a small customer project I'm running SOLR with embedded Tikka.
>>>> * memory consumption is an issue but can be handled
>>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>>> excessive CPU usage - requires SOLR restart but happens only once
>>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>>> erratic since I was never able to track the problem back to a particular
>>>> PDF document
>>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>>> consumption goes through the roof
>>>> If you doing really serious stuff I would recommend
>>>> * moving the document extraction stuff out of SOLR
>>>> * provide monitoring and recovery and stuck document extractions
>>>> ** killing worker threads
>>>> ** using external processed and kill them when spinning out of control
>>>> Cheers,
>>>> Siegfried Goeschl
>>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>>>> has improved over the years, but still not likely to be perfect.
>>>>> That said, I'm not aware of any specific PDF extraction issue that would
>>>>> bring down Solr - as opposed to causing a 500 status with an exception
>>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>>> documents, especially those which are graphic-intense can require a lot
>>>>> of memory. The rest of Solr could be adversely affected if all available
>>>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>>>> So, what is your specific symptom?
>>>>> -- Jack Krupansky
>>>>> -----Original Message----- From: Brian McDowell
>>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>>> To:
>>>>> Subject: pdfs
>>>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing
>>>>> Solr completely so that it actually needs to be manually restarted. We
>>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>>> problem because the release notes associated with the new tika version
>>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and
>>>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>>>> would be greatly appreciated. Thank you!

View raw message