lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "RecommendCustomIndexingWithTika" by ShawnHeisey
Date Sat, 26 May 2018 16:01:28 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "RecommendCustomIndexingWithTika" page has been changed by ShawnHeisey:
https://wiki.apache.org/solr/RecommendCustomIndexingWithTika?action=diff&rev1=3&rev2=4

Comment:
Add paragraph about crash handling and the ability to utilize more Tika functionality.

  
  Rich document formats are frequently not well documented, and even in cases where there
IS documentation for the format, not everyone who creates documents will follow the specifications
faithfully.  This creates a situation where software like Tika may encounter something that
it is simply not able to handle gracefully.  Although the authors put a LOT of effort into
making sure the software runs well in unexpected situations, the reality is that sometimes
a document will cause the software to malfunction and even crash.
  
- If the Tika software included in Solr for SolrCell crashes, that means that Solr itself
is going to crash too.  That is why it is not recommended for production use.  The ExtractingRequestHandler
is a proof-of-concept tool that can get you started with parsing rich documents, but for production
we strongly recommend writing an external program that incorporates Tika and sends the discovered
data to Solr.
+ If the Tika software included in Solr for SolrCell crashes, that means that Solr itself
is going to crash too.  That is why it is not recommended for production use.  The is a proof-of-concept
tool that can get you started with parsing rich documents, but for production we strongly
recommend writing an external program that incorporates Tika and sends the discovered data
to Solr.
+ 
+ If Tika processing is handled in a separate custom program, then any kind of malfunction
or crash can be handled gracefully and will not affect the operation of Solr.  With a custom
program, the full capability of Tika will be available.  The ExtractingRequestHandler is a
generic implementation that does not provide access to the full capability of Tika.  A custom
program is also capable of manipulating the index data in myriad ways.
  
  There is some [[https://lucidworks.com/2012/02/14/indexing-with-solrj/|example code]] on
the Lucidworks blog.
  

Mime
View raw message