lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by GrantIngersoll
Date Sat, 15 Nov 2008 13:20:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/solr/ExtractingRequestHandler

------------------------------------------------------------------------------
  A common need of users is the ability to ingest binary and/or structured documents such
as Office, PDF and other proprietary formats.  The [http://www.lucene.apache.org/tika Apache
Tika] project provides a framework for wrapping many different file format parsers, such as
PDFBox, POI and others.
  
  Solr's !ExtractingRequestHandler provides a wrapper around Tika to allow uses to upload
binary files to Solr and have Solr extract text from it and then index it.
+ 
+ = Concepts =
+ 
+ Before getting started, there are a couple of concepts that are helpful to understand.
+ 
+  * Tika does everything by producing an XHTML stream that it feeds to a SAX !ContentHandler.
+  * Solr then implements a !SolrContentHandler which reacts to Tika's SAX events and creates
a !SolrInputDocument.  You can override the !SolrContentHandler.  See the section below on
Customization.
+  * Tika produces Metadata information according to things like !DublinCore and other specifications.
 See the Tika javadocs on the Metadata class for what gets produced.  <!> TODO: Link
to Tika Javadocs <!>  See also http://incubator.apache.org/tika/formats.html
+  * We can map Tika's metadata fields to Solr fields.  We can boost these fields
+  * We can also pass in literals.
+  * We can apply an XPath expression to the Tika XHTML by passing in the ext.xpath parameter
(described below).  This restricts down the events that are given to the !SolrContentHandler.
 It is still up to the !SolrContentHandler to process those events.
+  * Field boosts are applied after name mapping
+  * It is useful to keep in mind what a given operation is using for input when specifying
parameters.  For instance, captured fields are specified to the !SolrContentHandler for capturing
content in the Tika XHTML.  Thus, the names of the fields are those of the XHTML, not the
mapped names.
+  * A default field name is required for indexing, but not for extraction only.
+  * The default field name and any literal values are not mapped.  They can be boosted. 
See the examples.
  
  = Getting Started =
  
@@ -81, +96 @@

  
  = Examples =
  
- = Implementation Details =
+ == Mapping and Capture ==
+ 
+ Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
+ 
+ {{{
+  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div
 -F "tutorial=@tutorial.pdf"
+ }}}
+ 
+ == Mapping, Capture and Boost ==
+ Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
 Boost foo_t by 3.
+ {{{
+ curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3
 -F "tutorial=@tutorial.pdf"
+ }}}
+ 
+ == Literals ==
+ 
+ Pass in some literal information along with the file:
+ {{{
+  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1
 -F "tutorial=@tutorial.pdf"
+ }}}
+ 
+ == Extract Only: ==
+ {{{
+ curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.extract.only=true
 --data-binary @tutorial.html  -H 'Content-type:text/html'
+ }}}
+ 
+ See TikaExtractOnlyExampleOutput.
  
  
  == Customizing ==

Mime
View raw message