lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by GrantIngersoll
Date Fri, 14 Nov 2008 18:36:47 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by GrantIngersoll:

The comment on the change is:
save interim text

  The !ExtractingRequestHandler will provide a wrapper around Tika to allow uses to upload
binary files to Solr and have Solr extract text from it and then index it.
- = Features =
+ = Getting Started =
+ * Check out Solr trunk
+ * Apply the patch: <!> TODO: PATCH NAME HERE <!>
+ * Add to your solr-trunk/lib
(the lib directory)
+ * ant clean example  // build the example
+ * cd example
+ * java -jar start.jar
+ In a separate window, post a file:
+ *  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 -F "myfile=@tutorial.html" 
+ or
+ * curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 --data-binary @tutorial.html  -H 'Content-type:text/html'  
+        <!> NOTE, this literally streams the file, which does not, then, provide info
to Solr about the name of the file, which means the !ExtractingRequestHandler will auto-generate
an ID for the file, unless you specify one by adding a literal value (see below).
+ or whatever other way you know how to do it.
+ = Input Parameters =
+ *<Tika Metadata Attribute> = Solr Field Name - Map a Tika metadata attribute
to a Solr field name.  If no mapping is specified, the metadata attribute will be used as
the field name.  If the field name doesn't exist, it can be ignored by setting the "ignore
undeclared fields" (ext.ignore.und.fl) attribute described below
+ * ext.boost.<NAME> = Float -  Boost the field with the specified name.  The NAME value
is the name of the Solr field (not the Tika metadata name). 
+ * ext.literal.<NAME> = <VALUE> - Create a field on the document with field name
NAME and literal value VALUE, e.g.
+ * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that are extracted
but are not in the Solr Schema.  Otherwise, an exception will be thrown for fields that are
not mapped.
+ * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML content
that satisfies the XPath expression.  See
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
+ * ext.extract.only = true|false - Default is false.  If true, return the extracted content
from Tika without indexing the document.  This literally includes the extracted XHTML as a
<str> in the response.  See TikaExtractOnlyExampleOutput.
+ * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields, named
after the attribute.  For example, when extracting from HTML, Tika can return the href values
of <a> tags as attributes of a tag name.  See the examples below.
+ * ext.def.fl = <NAME> - The name of the field to add the default content to.  See
also ext.capture below.  This NAME is not mapped, but it can be boosted.
+ * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately for adding
to the Solr document.
+ = Examples =
+ = Implementation Details =
- = Summary of Input Parameters =
+ == Customizing ==
+ While the current !ExtractingRequestHandler only allows for the use of the !SolrContentHandler
in creating new documents, it is relatively easy to implement your own extension that processes
the Tika extracted content differently and produces a different !SolrInputDocument.
+ To do this, implement your own instance of the !SolrContentHandlerFactory and override the
createFactory() method on the !ExtractingRequestHandler.

View raw message