lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley
Date Tue, 14 Jul 2009 20:39:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/ExtractingRequestHandler

The comment on the change is:
snapshot - updating to reflect committed code, simplifying

------------------------------------------------------------------------------
  
  [[TableOfContents]]
  
- Please see [https://issues.apache.org/jira/browse/SOLR-284 SOLR-284] for more information
on the incorporation of this feature into Solr 1.4.
- 
  = Introduction =
  
- A common need of users is the ability to ingest binary and/or structured documents such
as Office, PDF and other proprietary formats.  The [http://incubator.apache.org/tika/ Apache
Tika] project provides a framework for wrapping many different file format parsers, such as
PDFBox, POI and others.
+ <!> ["Solr1.4"]
  
+ A common need of users is the ability to ingest binary and/or structured documents such
as Office, Word, PDF and other proprietary formats.  The [http://incubator.apache.org/tika/
Apache Tika] project provides a framework for wrapping many different file format parsers,
such as PDFBox, POI and others.
+ 
- Solr's !ExtractingRequestHandler provides a wrapper around Tika to allow users to upload
binary files to Solr and have Solr extract text from it and then index it.
+ Solr's !ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr
and have Solr extract text from it and then index it.
  
  = Concepts =
  
@@ -18, +18 @@

  
   * Tika will automatically attempt to determine the input document type (word, pdf, etc.)
and extract the content appropriately. If you want, you can explicitly specify a MIME type
for Tika wth the stream.type parameter
   * Tika does everything by producing an XHTML stream that it feeds to a SAX !ContentHandler.
-  * Solr then implements a !SolrContentHandler which reacts to Tika's SAX events and creates
a !SolrInputDocument.  You can override the !SolrContentHandler.  See the section below on
Customization.
-  * Tika produces Metadata information according to things like !DublinCore and other specifications.
 See the Tika javadocs on the Metadata class for what gets produced.  <!> TODO: Link
to Tika Javadocs <!>  See also http://lucene.apache.org/tika/formats.html
+  * Solr then reacts to Tika's SAX events and creates the fields to index.
+  * Tika produces Metadata information such as Title, Subject, and Author, according to specifications
like !DublinCore.  See http://lucene.apache.org/tika/formats.html for the file types supported.
+  * All of the extracted text is added to the "content" field
   * We can map Tika's metadata fields to Solr fields.  We can boost these fields
-  * We can also pass in literals.
+  * We can also pass in literals for field values.
+  * We can apply an XPath expression to the Tika XHTML to restrict the content that is produced.
-  * We can apply an XPath expression to the Tika XHTML by passing in the ext.xpath parameter
(described below).  This restricts down the events that are given to the !SolrContentHandler.
 It is still up to the !SolrContentHandler to process those events.
-  * Field boosts are applied after name mapping
-  * It is useful to keep in mind what a given operation is using for input when specifying
parameters.  For instance, captured fields are specified to the !SolrContentHandler for capturing
content in the Tika XHTML.  Thus, the names of the fields are those of the XHTML, not the
mapped names.
-  * A default field name is required for indexing, but not for extraction only.
-  * The default field name and any literal values are not mapped.  They can be boosted. 
See the examples.
- 
- 
- == When To Use ==
- 
- The !ExtractingRequestHandler can be used any time you have the need to index both the metadata
and text of binary documents like Word, PDF, etc.  It doesn't, however, make sense to use
it if you are only interested in indexing the metadata about documents, since it will be much
faster to determine the metadata on the client side and then send that as a normal Solr document.
 In fact, it might make sense for someone to write a piece for SolrJ that uses Tika on the
client-side to construct Solr documents.
  
  = Getting Started with the Solr Example =
+  * Check out Solr trunk or get a 1.4 release or later.  
+  * If using a check out, running "ant example" will build the necessary jars.
+ Now start the solr example server:
+ {{{
+ cd example
+ java -jar start.jar
+ }}}
  
-  * Check out Solr trunk or get a 1.4 release or later if it exists.  
-  * If using a check out, running "ant example" will build the necessary jars.
-  * cd example
-  * The example directory comes with all required libs, but the configuration files are not
setup for the !ExtractingRequestHandler. Add the Configuration as defined below to the example's
solrconfig.xml.
-   *''recent versions of the solr code from svn, do contain a configuration section within
example/solr/conf/solrconfig.xml but it needs uncommented.''
-  * java -jar start.jar
-  * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in example/solr/solr.xml
in order for Solr to find the jars in example/solr/lib
+ In a separate window go to the {{{site/}}} directory (which contains some nice example docs)
and send Solr a file via HTTP POST:
+ {{{
+ cd site
+ curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfile=@tutorial.html"
+ }}}
+  * Note, the /site directory in the solr download contains some nice example docs to try

+  * hint: myfile=@tutorial.html needs a valid path (absolute or relative), e.g. "myfile=@../../site/tutorial.html"
if you are still in exampledocs dir.
+  * the {{{literal.id=doc1}}} param provides the necessary unique id for the document being
indexed
+  * the {{{commit=true}}} param causes Solr to do a commit after indexing the document, making
it immediately searchable.  For good performance when loading many documents, don't call commit
until you are done.
+  * using "curl" or other command line tools to post documents to Solr is nice for testing,
but not the recommended update method for best performance.
  
+ Now, you should be able to execute a query and find that document (open the following link
in your browser):
+ http://localhost:8983/solr/select?q=tutorial
  
- In a separate window, post a file:
+ You may notice that although you can search on any of the text in the sample document, you
may not be able to see that text when the document is retrieved.  This is simply because the
"content" field generated by Tika is mapped to the Solr field called "text" (which is indexed
but not stored) via the default map rule in {{{solrconfig.xml}}} that can be changed or overridden.
 For example, to store and see all metadata and content, execute the following:
+ {{{
+ curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&map.content=attr_content&commit=true'
-F "myfile=@tutorial.html"
+ }}}
+ And then query via http://localhost:8983/solr/select?q=attr_content:tutorial
  
+ // TODO: move this somewhere else to a more in-depth discussion of different ways to send
the data to Solr (prob with remoteStreaming discussion)
-  *  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 -F "myfile=@tutorial.html" //Note, the trunk/site contains some nice example docs 
-   * hint: myfile=@tutorial.html needs a valid path (absolute or relative), e.g. "myfile=@../../site/tutorial.html"
if you are still in exampledocs dir.
-   * with recent svn, you may need to add a unique '''id''' param to curl (see [http://www.nabble.com/Missing-required-field:-id-Using-ExtractingRequestHandler-td22611039.html
nabble msg]):
-   * e.g. curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.literal.id=123
-F "myfile=@../../site/tutorial.html"
- 
- or
- 
   * curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 --data-binary @tutorial.html  -H 'Content-type:text/html'  
         <!> NOTE, this literally streams the file, which does not, then, provide info
to Solr about the name of the file.
  
- or whatever other way you know how to do it.  Don't forget to COMMIT!
-  * e.g. curl "http://localhost:8983/solr/update/" -H "Content-Type: text/xml" --data-binary
'<commit waitFlush="false"/>'   --see [http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source
LucidImagination note]
  
  If you are not working from the supplied example/solr directory you must copy all libraries
from example/solr/libs into a libs directory within your own solr directory. The !ExtractingRequestHandler
is not incorporated into the solr war file, you have to install it separately.
+ 
+ = Input Parameters =
+ 
+  * ext.boost.<NAME> = Float -  Boost the field with the specified name.  The NAME
value is the name of the Solr field (not the Tika metadata name). 
+  * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately for adding
to the Solr document.  This can be useful for grabbing chunks of the XHTML into a separate
field.  For instance, it could be used to grab paragraphs (<p>) and index them into
a separate field.  Note that content is also still captured into the overall string buffer.
+  * ext.def.fl = <NAME> - The name of the field to add the default content to.  See
also ext.capture below.  This NAME is not mapped, but it can be boosted.
+  * ext.extract.only = true|false - Default is false.  If true, return the extracted content
from Tika without indexing the document.  This literally includes the extracted XHTML as a
<str> in the response.  See TikaExtractOnlyExampleOutput.
+  * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields, named
after the attribute.  For example, when extracting from HTML, Tika can return the href values
of <a> tags as attributes of a tag name.  See the examples below.
+  * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that are extracted
but are not in the Solr Schema.  Otherwise, an exception will be thrown for fields that are
not mapped.
+  * ext.literal.<NAME> = <VALUE> - Create a field on the document with field
name NAME and literal value VALUE, e.g. ext.literal.foo=bar.  May be multivalued if the Field
is multivalued.  Otherwise, the ERH will throw an exception.
+  * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika metadata attribute
to a Solr field name.  If no mapping is specified, the metadata attribute will be used as
the field name.  If the field name doesn't exist, it can be ignored by setting the "ignore
undeclared fields" (ext.ignore.und.fl) attribute described below
+  * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, such that
it is easy to map new metadata fields to dynamic fields
+  * ext.resource.name=<File Name> - Optional.  The name of the file.  Tika can use
it as a hint for detecting mime type.
+  * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML content
that satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
  
  = Configuration =
  
@@ -104, +118 @@

  EEE MMM d HH:mm:ss yyyy
  }}}
  
- = Input Parameters =
+ == MultiCore config
+  * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in example/solr/solr.xml
in order for Solr to find the jars in example/solr/lib
  
-  * ext.boost.<NAME> = Float -  Boost the field with the specified name.  The NAME
value is the name of the Solr field (not the Tika metadata name). 
-  * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately for adding
to the Solr document.  This can be useful for grabbing chunks of the XHTML into a separate
field.  For instance, it could be used to grab paragraphs (<p>) and index them into
a separate field.  Note that content is also still captured into the overall string buffer.
-  * ext.def.fl = <NAME> - The name of the field to add the default content to.  See
also ext.capture below.  This NAME is not mapped, but it can be boosted.
-  * ext.extract.only = true|false - Default is false.  If true, return the extracted content
from Tika without indexing the document.  This literally includes the extracted XHTML as a
<str> in the response.  See TikaExtractOnlyExampleOutput.
-  * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields, named
after the attribute.  For example, when extracting from HTML, Tika can return the href values
of <a> tags as attributes of a tag name.  See the examples below.
-  * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that are extracted
but are not in the Solr Schema.  Otherwise, an exception will be thrown for fields that are
not mapped.
-  * ext.literal.<NAME> = <VALUE> - Create a field on the document with field
name NAME and literal value VALUE, e.g. ext.literal.foo=bar.  May be multivalued if the Field
is multivalued.  Otherwise, the ERH will throw an exception.
-  * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika metadata attribute
to a Solr field name.  If no mapping is specified, the metadata attribute will be used as
the field name.  If the field name doesn't exist, it can be ignored by setting the "ignore
undeclared fields" (ext.ignore.und.fl) attribute described below
-  * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, such that
it is easy to map new metadata fields to dynamic fields
-  * ext.resource.name=<File Name> - Optional.  The name of the file.  Tika can use
it as a hint for detecting mime type.
-  * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML content
that satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
  
  = Metadata =
  
@@ -171, +175 @@

  See TikaExtractOnlyExampleOutput.
  
  
+ == Additional Resources ==
+ * [http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source
Lucid Imagination article]
+ * [http://lucene.apache.org/tika/formats.html Supported document formats via Tika]
- = Customizing =
- 
- While the current !ExtractingRequestHandler only allows for the use of the !SolrContentHandler
in creating new documents, it is relatively easy to implement your own extension that processes
the Tika extracted content differently and produces a different !SolrInputDocument.
- 
- To do this, implement your own instance of the !SolrContentHandlerFactory and override the
createFactory() method on the !ExtractingRequestHandler.
  
  = What's in a Name =
  

Mime
View raw message