lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <char...@flax.co.uk>
Subject Re: How is Tika used with Solr
Date Wed, 10 Feb 2016 08:54:31 GMT
On 09/02/2016 22:49, Alexandre Rafalovitch wrote:
> Solr uses Tika directly. And not in the most efficient way. It is
> there mostly for convenience rather than performance.
>
> So, for performance, Solr recommendation is also to run Tika
> separately and only send Solr the processed documents.

Absolutely. It's entirely possible to kill Tika with a bad PDF or 
something, bringing down your Solr instance.

Here's something a colleague wrote to wrap Tika in a server, maybe you 
can use it:
https://github.com/mattflax/dropwizard-tika-server

Cheers

Charlie
>
> Regards,
>      Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 10 February 2016 at 09:46, Steven White <swhite4141@gmail.com> wrote:
>> Hi folks,
>>
>> I'm writing a file-system-crawler that will index files.  The file system
>> is going to be very busy an I anticipate on average 10 new updates per
>> min.  My application checks for new or updated files once every 1 min.  I
>> use Tika to extract the raw-text off those files and send them over to Solr
>> for indexing.  My application will be running 24x7xN-days.  It will not
>> recycle unless if the OS is restarted.
>>
>> Over at Tika mailing list, I was told the following:
>>
>> "As a side note, if you are handling a bunch of files from the wild in a
>> production environment, I encourage separating Tika into a separate jvm vs
>> tying it into any post processing – consider tika-batch and writing
>> separate text files for each file processed (not so efficient, but
>> exceedingly robust).  If this is demo code or you know your document set
>> well enough, you should be good to go with keeping Tika and your
>> postprocessing steps in the same jvm."
>>
>> My question is, how does Solr utilize Tika?  Does it run Tika in its own
>> JVM as an out-of-process application or does it link with Tika JARs
>> directly?  If it links in directly, are there known issues with Solr
>> integrated with Tika because of Tika issues?
>>
>> Thanks
>>
>> Steve


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Mime
View raw message