lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: pdfs
Date Mon, 26 May 2014 16:20:24 GMT
Brian:

Yeah, if you can share the PDF that would be great. Parsing via Tika should
not bring down Solr, although I supposed there could be something in Tika
that is pathologically bad.

You could also try using Tika itself in SolrJ and indexing from a client. That
might let you
1> more gracefully handle this without shutting down Solr
2> use different versions of Tika.

Personally I like offloading the document parsing to clients anyway since it
lessens the load on the Solr server and scales much better, but YMMV.

It's not actually very difficult, here's a skeleton (rip out the DB parts)
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick

On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <sgoeschl@gmx.at> wrote:
> Sorry typo :- can you send me the PDF by email directly :-)
>
> Siegfried Goeschl
>
> On 25 May 2014, at 10:06, Siegfried Goeschl <sgoeschl@gmx.at> wrote:
>
>> Hi Brian,
>>
>> can you send me the email? I would like to play around :-)
>>
>> Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce the
issue …
>>
>> Thanks in advance
>>
>> Siegfried Goeschl
>>
>>
>> On 25 May 2014, at 04:18, Brian McDowell <brianmcd08@gmail.com> wrote:
>>
>>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>>> will not parse and that's fine, but others are more problematic in that our
>>> Operations team has to restart Solr because it just hangs and accepts no
>>> more documents. I actually have identified a pdf that will bring down Solr
>>> every time. Does anyone think that doing pre-validation using the pdfbox
>>> jar will work? Or, will trying to validate just hang as well? Any help is
>>> appreciated.
>>>
>>>
>>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <jack@basetechnology.com>wrote:
>>>
>>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Siegfried Goeschl
>>>> Sent: Thursday, May 22, 2014 4:35 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: pdfs
>>>>
>>>>
>>>> Hi folks,
>>>>
>>>> for a small customer project I'm running SOLR with embedded Tikka.
>>>>
>>>> * memory consumption is an issue but can be handled
>>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>>> excessive CPU usage - requires SOLR restart but happens only once
>>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>>> erratic since I was never able to track the problem back to a particular
>>>> PDF document
>>>>
>>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>>> consumption goes through the roof
>>>>
>>>> If you doing really serious stuff I would recommend
>>>> * moving the document extraction stuff out of SOLR
>>>> * provide monitoring and recovery and stuck document extractions
>>>> ** killing worker threads
>>>> ** using external processed and kill them when spinning out of control
>>>>
>>>> Cheers,
>>>>
>>>> Siegfried Goeschl
>>>>
>>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>>>
>>>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>>>> has improved over the years, but still not likely to be perfect.
>>>>>
>>>>> That said, I'm not aware of any specific PDF extraction issue that would
>>>>> bring down Solr - as opposed to causing a 500 status with an exception
>>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>>> documents, especially those which are graphic-intense can require a lot
>>>>> of memory. The rest of Solr could be adversely affected if all available
>>>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>>>>
>>>>> So, what is your specific symptom?
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> -----Original Message----- From: Brian McDowell
>>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: pdfs
>>>>>
>>>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing
down
>>>>> Solr completely so that it actually needs to be manually restarted. We
are
>>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>>> problem because the release notes associated with the new tika version
and
>>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and
now
>>>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>>>> would be greatly appreciated. Thank you!
>>>>>
>>>>
>>>>
>>
>

Mime
View raw message