lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: How is Tika used with Solr
Date Wed, 10 Feb 2016 17:37:55 GMT
Timothy's points are absolutely spot-on. In production scenarios, if
you use the simple
"run Tika in a SolrJ program" approach you _must_ abort the program on
OOM errors
and the like and  figure out what's going on with the offending
document(s). Or record the
name somewhere and skip it next time 'round. Or........

How much you have to build in here really depends on your use case.
For "small enough"
sets of documents or one-time indexing, you can get by with dealing
with errors one at a time.
For robust systems where you have to have indexing available at all
times and _especially_
where you don't control the document corpus, you have to build
something far more
tolerant as per Tim's comments.

FWIW,
Erick

On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. <tallison@mitre.org> wrote:
> I completely agree on the impulse, and for the vast majority of the time (regular catchable
exceptions), that'll work.  And, by vast majority, aside from oom on very large files, we
aren't seeing these problems any more in our 3 million doc corpus (y, I know, small by today's
standards) from govdocs1 and Common Crawl over on our Rackspace vm.
>
> Given my focus on Tika, I'm overly sensitive to the worst case scenarios.  I find it
encouraging, Erick, that you haven't seen these types of problems, that users aren't complaining
too often about catastrophic failures of Tika within Solr Cell, and that this thread is not
yet swamped with integrators agreeing with me. :)
>
> However, because oom can leave memory in a corrupted state (right?), because you can't
actually kill a thread for a permanent hang and because Tika is a kitchen sink and we can't
prevent memory leaks in our dependencies, one needs to be aware that bad things can happen...if
only very, very rarely.  For a fellow traveler who has run into these issues on massive data
sets, see also [0].
>
> Configuring Hadoop to work around these types of problems is not too difficult -- it
has to be done with some thought, though.  On conventional single box setups, the ForkParser
within Tika is one option, tika-batch is another.  Hand rolling your own parent/child process
is non-trivial and is not necessary for the vast majority of use cases.
>
>
> [0] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, February 09, 2016 10:05 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: How is Tika used with Solr
>
> My impulse would be to _not_ run Tika in its own JVM, just catch any exceptions in my
code and "do the right thing". I'm not sure I see any real benefit in yet another JVM.
>
> FWIW,
> Erick
>
> On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. <tallison@mitre.org> wrote:
>> I have one answer here [0], but I'd be interested to hear what Solr users/devs/integrators
have experienced on this topic.
>>
>> [0]
>> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1P
>> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlo
>> ok.com%3E
>>
>> -----Original Message-----
>> From: Steven White [mailto:swhite4141@gmail.com]
>> Sent: Tuesday, February 09, 2016 6:33 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How is Tika used with Solr
>>
>> Thank you Erick and Alex.
>>
>> My main question is with a long running process using Tika in the same JVM as my
application.  I'm running my file-system-crawler in its own JVM (not Solr's).  On Tika mailing
list, it is suggested to run Tika's code in it's own JVM and invoke it from my file-system-crawler
using Runtime.getRuntime().exec().
>>
>> I fully understand from Alex suggestion and link provided by Erick to use Tika outside
Solr.  But what about using Tika within the same JVM as my file-system-crawler application
or should I be making a system call to invoke another JAR, that runs in its own JVM to extract
the raw text?  Are there known issues with Tika when used in a long running process?
>>
>> Steve
>>
>>

Mime
View raw message