lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by alexandersulz
Date Tue, 28 Sep 2010 09:55:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by alexandersulz.
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=62&rev2=63

--------------------------------------------------

  <<TableOfContents>>
  
  = Introduction =
- 
  <!> [[Solr1.4]]
  
  A common need of users is the ability to ingest binary and/or structured documents such
as Office, Word, PDF and other proprietary formats.  The [[http://incubator.apache.org/tika/|Apache
Tika]] project provides a framework for wrapping many different file format parsers, such
as PDFBox, POI and others.
@@ -13, +12 @@

  Solr's !ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr
and have Solr extract text from it and then index it.
  
  = Concepts =
- 
  Before getting started, there are a few concepts that are helpful to understand.
  
   * Tika will automatically attempt to determine the input document type (word, pdf, etc.)
and extract the content appropriately. If you want, you can explicitly specify a MIME type
for Tika wth the stream.type parameter
@@ -26, +24 @@

   * We can apply an XPath expression to the Tika XHTML to restrict the content that is produced.
  
  = Getting Started with the Solr Example =
-  * Check out Solr trunk or get a 1.4 release or later.  
+  * Check out Solr trunk or get a 1.4 release or later.
   * Note: if using a check out of the Solr source code instead of a binary release, running
"ant example" will build the necessary jars.
+ 
  Now start the solr example server:
+ 
  {{{
  cd example
  java -jar start.jar
  }}}
- 
  In a separate window go to the {{{docs/}}} directory (which contains some nice example docs),
or the {{{site}}} directory if you built Solr from source, and send Solr a file via HTTP POST:
+ 
  {{{
  cd site
  curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfile=@tutorial.html"
  }}}
-  * Note, the /site directory in the solr download contains some nice example docs to try

+  * Note, the /site directory in the solr download contains some nice example docs to try
   * hint: myfile=@tutorial.html needs a valid path (absolute or relative), e.g. "myfile=@../../site/tutorial.html"
if you are still in exampledocs dir.
   * the {{{literal.id=doc1}}} param provides the necessary unique id for the document being
indexed
   * the {{{commit=true}}} param causes Solr to do a commit after indexing the document, making
it immediately searchable.  For good performance when loading many documents, don't call commit
until you are done.
   * using "curl" or other command line tools to post documents to Solr is nice for testing,
but not the recommended update method for best performance.
  
- Now, you should be able to execute a query and find that document (open the following link
in your browser):
+ Now, you should be able to execute a query and find that document (open the following link
in your browser): http://localhost:8983/solr/select?q=tutorial
- http://localhost:8983/solr/select?q=tutorial
  
  You may notice that although you can search on any of the text in the sample document, you
may not be able to see that text when the document is retrieved.  This is simply because the
"content" field generated by Tika is mapped to the Solr field called "text", which is indexed
but not stored. This is done via the default map rule in the {{{/udate/extract}}} handler
in {{{solrconfig.xml}}} and can be easily changed or overridden. For example, to store and
see all metadata and content, execute the following:
+ 
  {{{
- curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true'
-F "myfile=@tutorial.html"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "myfile=@tutorial.html"
  }}}
   * The {{{uprefix=attr_}}} param causes all generated fields that aren't defined in the
schema to be prefixed with attr_ (which is a dynamic field that is stored).
   * The {{{fmap.content=attr_content}}} param overrides the default {{{fmap.content=text}}}
causing the content to be added to the attr_content field instead.
  
-  
  And then query via http://localhost:8983/solr/select?q=attr_content:tutorial
  
  = Input Parameters =
@@ -75, +74 @@

  
   * extractFormat=xml|text - Default is xml.  Controls the serialization format of the extract
content.  xml format is actually XHTML, like passing the -x command to the tika command line
application, while text is like the -t command.  See [[https://issues.apache.org/jira/browse/SOLR-1274|SOLR-1274]].
  
- 
  == Order of field operations ==
   1. fields are generated by Tika or passed in as literals via {{{literal.fieldname=value}}}
   1. if lowernames==true, fields are mapped to lower case
@@ -86, +84 @@

  // TODO: this is out of date as of Solr 1.4 - dist/apache-solr-cell-1.4.jar and all of contrib/extraction/lib
are needed
  
  If you are not working from the supplied example/solr directory you must copy all libraries
from example/solr/libs into a libs directory within your own solr directory. The !ExtractingRequestHandler
is not incorporated into the solr war file, you have to install it separately.
- 
  
  Example config:
  
@@ -104, +101 @@

      </lst>
    </requestHandler>
  }}}
- 
  In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field
named last_modified.  We are also telling it to ignore undeclared fields.  These are all overridden
parameters.
  
  The tika.config entry points to a file containing a Tika configuration.  You would only
need this if you have customized your own Tika configuration.  The Tika config contains info
about parsers, mime types, etc.
  
- You may also need to adjust the {{{multipartUploadLimitInKB}}} attribute as follows if you
are submitting very large documents. 
+ You may also need to adjust the {{{multipartUploadLimitInKB}}} attribute as follows if you
are submitting very large documents.
+ 
  {{{
    <requestDispatcher handleSelect="true" >
      <requestParsers enableRemoteStreaming="{true|false}" multipartUploadLimitInKB="2048000"
/>
      ....
  }}}
- 
  For remote streaming, you must enable remote stream. See ContentStream for more info or
just set enableRemoteStreaming=true in the snippet above.  As an example of using remote streaming,
you can do:
+ 
  {{{
   curl "http://localhost:8983/solr/update/extract?stream.file=/path/to/file/StatesLeftToVisit.doc&stream.contentType=application/msword&literal.id=states.doc"
  }}}
- 
- 
  Lastly, the date.formats allows you to specify various java.text.SimpleDateFormat date formats
for working with transforming extracted input to a Date.  Solr comes configured with the following
date formats (see the DateUtil class in Solr)
+ 
  {{{
  yyyy-MM-dd'T'HH:mm:ss'Z'
  yyyy-MM-dd'T'HH:mm:ss
@@ -134, +130 @@

  EEEE, dd-MMM-yy HH:mm:ss zzz
  EEE MMM d HH:mm:ss yyyy
  }}}
- 
  == MultiCore config ==
   * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in example/solr/solr.xml
in order for Solr to find the jars in example/solr/lib
  
- 
  = Metadata =
- 
  As has been implied up to now, Tika produces Metadata about the document.  Metadata often
contains things like the author of the file or the number of pages, etc.  The Metadata produced
depends on the type of document submitted.  For instance, PDFs have different metadata from
Word docs.
  
  In addition to Tika's metadata, Solr adds the following metadata (defined in !ExtractingMetadataConstants):
+ 
   * "stream_name" - The name of the !ContentStream as uploaded to Solr.  Depending on how
the file is uploaded, this may or may not be set.
-  * "stream_source_info" - Any source info about the stream.  See !ContentStream.  
+  * "stream_source_info" - Any source info about the stream.  See !ContentStream.
   * "stream_size" - The size of the stream in bytes(?)
   * "stream_content_type" - The content type of the stream, if available.
  
  It is highly recommend that you try using the extract only option to see what values actually
get set for these.
  
  = Examples =
- 
  == Mapping and Capture ==
- 
  Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
  
  {{{
   curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div"
 -F "tutorial=@tutorial.pdf"
  }}}
- 
  == Mapping, Capture and Boost ==
  Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
 Boost foo_t by 3.
+ 
  {{{
  curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3"
-F "tutorial=@tutorial.pdf"
  }}}
- 
  == Literals ==
- 
  To add in your own metadata, pass in the literal parameter along with the file:
+ 
  {{{
  curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah"
 -F "tutorial=@tutorial.pdf"
  }}}
- 
  == XPath ==
- 
  Restrict down the XHTML returned by Tika by passing in an XPath expression
  
  {{{
  curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()"
 -F "tutorial=@tutorial.pdf"
  }}}
- 
  == Extract Only ==
  {{{
  curl "http://localhost:8983/solr/update/extract?&extractOnly=true"  --data-binary @tutorial.html
 -H 'Content-type:text/html'
  }}}
- 
  A the output includes XML generated by Tika (and is hence further escaped by Solr's XML)
using a different output format enhance the readability:
+ 
  {{{
  curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true"
 --data-binary @tutorial.html  -H 'Content-type:text/html'
  }}}
- 
  See TikaExtractOnlyExampleOutput.
  
  = Sending documents to Solr =
- 
  // TODO: describe the different ways to send the documents to solr (POST body, form encoded,
remoteStreaming)
+ 
-  * curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text"
 --data-binary @tutorial.html  -H 'Content-type:text/html'  
+  * curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text"
 --data-binary @tutorial.html  -H 'Content-type:text/html'
-        <!> NOTE, this literally streams the file as the body of the POST, which does
not, then, provide info to Solr about the name of the file.
+   . <!> NOTE, this literally streams the file as the body of the POST, which does
not, then, provide info to Solr about the name of the file.
  
  == SolrJ ==
  Use the !ContentStreamUpdateRequest (see ContentStreamUpdateRequestExample for a full example):
+ 
  {{{#!java
  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
  up.addFile(new File("mailing_lists.pdf"));
@@ -212, +200 @@

  rsp = server.query( new SolrQuery( "*:*") );
  Assert.assertEquals( 1, rsp.getResults().getNumFound() );
  }}}
- 
  If you want to set a '''multiValued''' field, use the ''ModifiableSolrParams'' class like
this:
  
  {{{#!java
@@ -222, +209 @@

  }
  up.setParams(p);
  }}}
- 
  You could also set all of the other literals and parameters in this class, and then use
the ''setParams'' method to apply the changes to your content stream.
  
  = Committer Notes =
- 
  == Upgrading Tika ==
- 
   * Get the Tika version to upgrade and unpackage it and switch into that directory.
   * mvn install
   * mvn dependency:copy-dependencies
@@ -237, +221 @@

   * Commit
  
  = Additional Resources =
- * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid
Imagination article]]
+ * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid
Imagination article]] * [[http://tika.apache.org/0.7/formats.html|Supported document formats
via Tika (0.7)]]
- * [[http://tika.apache.org/0.7/formats.html|Supported document formats via Tika (0.7)]]
  
  = What's in a Name =
- 
  Grant was writing the javadocs for the code and needed an entry for the <title> tag
and wrote out "Solr Content Extraction Library", since the contrib directory is named "extraction".
 This then lead to an "acronym":  Solr CEL which then gets mashed to: Solr Cell.  Hence, the
project name is "Solr Cell".  It's also appropriate because a Solar Cell's job is to convert
the raw energy of the Sun to electricity, and this contrib's module is responsible for converting
the "raw" content of a document to something usable by Solr. http://en.wikipedia.org/wiki/Solar_cell
  

Mime
View raw message