lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by HossMan
Date Tue, 05 Jun 2012 22:47:06 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by HossMan:
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=73&rev2=74

Comment:
fill in osme TODOs and clean up some formatting

   1. if {{{uprefix}}} is specified, any unknown field names are prefixed with that value,
else if {{{defaultField}}} is specified, unknown fields are copied to that.
  
  = Configuration =
- // TODO: this is out of date as of Solr 1.4 - dist/apache-solr-cell-1.4.jar and all of contrib/extraction/lib
are needed
  
- If you are not working from the supplied example/solr directory you must copy all libraries
from example/solr/libs into a libs directory within your own solr directory. The !ExtractingRequestHandler
is not incorporated into the solr war file, you have to install it separately.
+ The !ExtractingRequestHandler is not incorporated into the solr war file, it is provided
as a SolrPlugin, and you have to load it (and it's dependencies) explicitly.
  
- Example config:
+ Example configuration for loading plugin and dependencies:
+ 
+ {{{
+   <lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />
+   <lib dir="../../contrib/extraction/lib" regex=".*\.jar" />
+ }}}
+ 
+ 
+ Example configuration for the Handler:
  
  {{{
  <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
@@ -101, +108 @@

      </lst>
    </requestHandler>
  }}}
+ 
  In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field
named last_modified.  We are also telling it to ignore undeclared fields.  These are all overridden
parameters.
  
  The tika.config entry points to a file containing a Tika configuration.  You would only
need this if you have customized your own Tika configuration.  The Tika config contains info
about parsers, mime types, etc.
@@ -184, +192 @@

  See TikaExtractOnlyExampleOutput.
  
  = Sending documents to Solr =
- // TODO: describe the different ways to send the documents to solr (POST body, form encoded,
remoteStreaming)
  
+ The ExtractingRequestHandler can process any document sent as a ContentStream ...
+  * Raw POST
+  * Multi-part file upload (each file is processed as a distinct document)
+  * "stream.body", "stream.url" and "stream.file" request params.
+ 
+ Example...
+ 
+ {{{
-  * curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text"
 --data-binary @tutorial.html  -H 'Content-type:text/html'
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" 
--data-binary @tutorial.html  -H 'Content-type:text/html'
+ }}}
+ 
-   . <!> NOTE, this literally streams the file as the body of the POST, which does
not, then, provide info to Solr about the name of the file.
+ <!> NOTE, this literally streams the file as the body of the POST, which does not,
then, provide info to Solr about the name of the file.
  
  == SolrJ ==
  Use the !ContentStreamUpdateRequest (see ContentStreamUpdateRequestExample for a full example):
@@ -225, +242 @@

   * Commit
  
  = Additional Resources =
- * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid
Imagination article]] * [[http://tika.apache.org/0.10/formats.html|Supported document formats
via Tika (0.10)]]
+  * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid
Imagination article]] 
+  * [[http://tika.apache.org/0.10/formats.html|Supported document formats via Tika (0.10)]]
  
  = What's in a Name =
  Grant was writing the javadocs for the code and needed an entry for the <title> tag
and wrote out "Solr Content Extraction Library", since the contrib directory is named "extraction".
 This then lead to an "acronym":  Solr CEL which then gets mashed to: Solr Cell.  Hence, the
project name is "Solr Cell".  It's also appropriate because a Solar Cell's job is to convert
the raw energy of the Sun to electricity, and this contrib's module is responsible for converting
the "raw" content of a document to something usable by Solr. http://en.wikipedia.org/wiki/Solar_cell

Mime
View raw message