lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <char...@flax.co.uk>
Subject Re: Bypassing ExtractingRequestHandler
Date Fri, 10 Jun 2016 08:22:04 GMT
On 10/06/2016 02:20, Justin Lee wrote:
> Has anybody had any experience bypassing ExtractingRequestHandler and
> simply managing Tika manually?  I want to make a small modification to Tika
> to get and save additional data from my PDFs, but I have been
> procrastinating in no small part due to the unpleasant prospect of setting
> up a development environment where I could compile and debug modifications
> that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
> occurs to me that it would be much easier if the two were separate, so I
> could have direct control over Tika and just submit the text to Solr after
> extraction.  Am I going to regret this approach?  I'm not sure what
> ExtractingRequestHandler really does for me that Tika doesn't already do.

We tend to prefer running Tika externally as it's entirely possible that 
Tika will crash or hang with certain files - and that will bring down 
Solr if you're running Tika within it. Here's a Dropwizard wrapper 
around Tika that might be of use:
https://github.com/mattflax/dropwizard-tika-server

Cheers

Charlie
>
> Also, I was reading this
> <http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
> stackoverflow entry and someone offhandedly mentioned that
> ExtractingRequestHandler might be separated in the future anyway. Is there
> a public roadmap for the project, or does one have to keep up with the
> developer's mailing list and hunt through JIRA entries to keep up with the
> pulse of the project?
>
> Thanks,
> Justin
>


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Mime
View raw message