lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: DataImportHandler - Unable to load Tika Config Processing Document # 1
Date Wed, 08 Feb 2017 21:14:41 GMT
> Thank you I will follow Erick's steps
> BTW I am also trying to ingesting using Flume , Flume uses Morphlines along with Tika
> Even Flume SolrSink will have the same issue?

Yes, when using Tika you run the risk of it choking on a document, eating CPU and/or RAM until
everything dies. This is also true when you run it standalone. The problem is usually caused
by PDF and Office documents that are unusual, corrupt or incomplete (e.g. truncated in size)
or extremely large. But even ordinary HTML can get you into trouble due to extreme sizes or
very deep nested elements.

But, in general, it is not a problem you will experience frequently. We operate broad and
large scale web crawlers, ingesting all kinds of bad stuff all the time. The trick to avoid
problems is running each Tika parse in a separate thread, have a timer and kill the thread
if it reaches a limit. It can still go wrong, but trouble is very rare.

Running it standalone and talking to it over network is safest, but not very portable/easy
distributable on Hadoop or other platforms.

Mime
View raw message