lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xavi jmlucjav <jmluc...@gmail.com>
Subject Re: How is Tika used with Solr
Date Thu, 11 Feb 2016 23:08:07 GMT
For sure, if I need heavy duty text extraction again, Tika would be the
obvious choice if it covers dealing with hangs. I never used tika-server
myself (not sure if it existed at the time) just used tika from my own jvm.

On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. <tallison@mitre.org>
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar <input_dir> <output_dir>
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This
> creates a parent and child process, if the child process notices a hung
> thread, it dies, and the parent restarts it.  Or if your OS gets upset with
> the child process and kills it out of self preservation, the parent
> restarts the child, or if there's an OOM...and you can configure how often
> the child shuts itself down (with parental restarting) to mitigate memory
> leaks.
>
> So, y, if your use case allows <input_dir> <output_dir>, then we now have
> that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any
> interest in that?
>
>
> -----Original Message-----
> From: xavi jmlucjav [mailto:jmlucjav@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of files,
> in the end you find stuff (pdfs are typically nasty) that will hang tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a
> watchdog feature to kill what seemed like a hanged extracting thread. That
> feature is super important for a robust text extracting pipeline. Has Tika
> gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if
> > you use the simple "run Tika in a SolrJ program" approach you _must_
> > abort the program on OOM errors and the like and  figure out what's
> > going on with the offending document(s). Or record the name somewhere
> > and skip it next time 'round. Or........
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all
> > times and _especially_ where you don't control the document corpus,
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > <tallison@mitre.org>
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of the
> > > time
> > (regular catchable exceptions), that'll work.  And, by vast majority,
> > aside from oom on very large files, we aren't seeing these problems
> > any more in our 3 million doc corpus (y, I know, small by today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen these
> > types of problems, that users aren't complaining too often about
> > catastrophic failures of Tika within Solr Cell, and that this thread
> > is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state (right?),
> > because you can't actually kill a thread for a permanent hang and
> > because Tika is a kitchen sink and we can't prevent memory leaks in
> > our dependencies, one needs to be aware that bad things can
> > happen...if only very, very rarely.  For a fellow traveler who has run
> > into these issues on massive data sets, see also [0].
> > >
> > > Configuring Hadoop to work around these types of problems is not too
> > difficult -- it has to be done with some thought, though.  On
> > conventional single box setups, the ForkParser within Tika is one
> > option, tika-batch is another.  Hand rolling your own parent/child
> > process is non-trivial and is not necessary for the vast majority of use
> cases.
> > >
> > >
> > > [0]
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Tuesday, February 09, 2016 10:05 PM
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Re: How is Tika used with Solr
> > >
> > > My impulse would be to _not_ run Tika in its own JVM, just catch any
> > exceptions in my code and "do the right thing". I'm not sure I see any
> > real benefit in yet another JVM.
> > >
> > > FWIW,
> > > Erick
> > >
> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
> > > <tallison@mitre.org>
> > wrote:
> > >> I have one answer here [0], but I'd be interested to hear what Solr
> > users/devs/integrators have experienced on this topic.
> > >>
> > >> [0]
> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CC
> > >> Y1P
> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.ou
> > >> tlo
> > >> ok.com%3E
> > >>
> > >> -----Original Message-----
> > >> From: Steven White [mailto:swhite4141@gmail.com]
> > >> Sent: Tuesday, February 09, 2016 6:33 PM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: How is Tika used with Solr
> > >>
> > >> Thank you Erick and Alex.
> > >>
> > >> My main question is with a long running process using Tika in the
> > >> same
> > JVM as my application.  I'm running my file-system-crawler in its own
> > JVM (not Solr's).  On Tika mailing list, it is suggested to run Tika's
> > code in it's own JVM and invoke it from my file-system-crawler using
> > Runtime.getRuntime().exec().
> > >>
> > >> I fully understand from Alex suggestion and link provided by Erick
> > >> to
> > use Tika outside Solr.  But what about using Tika within the same JVM
> > as my file-system-crawler application or should I be making a system
> > call to invoke another JAR, that runs in its own JVM to extract the
> > raw text?  Are there known issues with Tika when used in a long running
> process?
> > >>
> > >> Steve
> > >>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message