lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by ChrisHarris
Date Fri, 21 Nov 2008 22:56:35 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by ChrisHarris:
http://wiki.apache.org/solr/ExtractingRequestHandler

The comment on the change is:
Note existence of stream.type

------------------------------------------------------------------------------
  
  Before getting started, there are a few concepts that are helpful to understand.
  
+  * Tika will automatically attempt to determine the input document type (word, pdf, etc.)
and extract the content appropriately. If you want, you can explicitly specify a MIME type
for Tika wth the stream.type parameter
   * Tika does everything by producing an XHTML stream that it feeds to a SAX !ContentHandler.
   * Solr then implements a !SolrContentHandler which reacts to Tika's SAX events and creates
a !SolrInputDocument.  You can override the !SolrContentHandler.  See the section below on
Customization.
   * Tika produces Metadata information according to things like !DublinCore and other specifications.
 See the Tika javadocs on the Metadata class for what gets produced.  <!> TODO: Link
to Tika Javadocs <!>  See also http://incubator.apache.org/tika/formats.html

Mime
View raw message