lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <yo...@lucidimagination.com>
Subject Re: [Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley
Date Tue, 14 Jul 2009 20:45:20 GMT
FYI, I'm only really done down to the "// TODO: move this somewhere else[...]"

I've removed a number of things that were complicated or misleading
and tried to improve the first example - a good OOTB experience with
this handler is esp important I think.  Let me know if you think I've
removed something I shouldn't have, or if anything will be confusing
to someone looking at it the first time.

I'll continue making changes today and tomorrow.

-Yonik
http://www.lucidimagination.com

On Tue, Jul 14, 2009 at 4:39 PM, Apache Wiki<wikidiffs@apache.org> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
>
> The following page has been changed by YonikSeeley:
> http://wiki.apache.org/solr/ExtractingRequestHandler
>
> The comment on the change is:
> snapshot - updating to reflect committed code, simplifying
>
> ------------------------------------------------------------------------------
>
>  [[TableOfContents]]
>
> - Please see [https://issues.apache.org/jira/browse/SOLR-284 SOLR-284] for more information
on the incorporation of this feature into Solr 1.4.
> -
>  = Introduction =
>
> - A common need of users is the ability to ingest binary and/or structured documents
such as Office, PDF and other proprietary formats.  The [http://incubator.apache.org/tika/
Apache Tika] project provides a framework for wrapping many different file format parsers,
such as PDFBox, POI and others.
> + <!> ["Solr1.4"]
>
> + A common need of users is the ability to ingest binary and/or structured documents
such as Office, Word, PDF and other proprietary formats.  The [http://incubator.apache.org/tika/
Apache Tika] project provides a framework for wrapping many different file format parsers,
such as PDFBox, POI and others.
> +
> - Solr's !ExtractingRequestHandler provides a wrapper around Tika to allow users to upload
binary files to Solr and have Solr extract text from it and then index it.
> + Solr's !ExtractingRequestHandler uses Tika to allow users to upload binary files to
Solr and have Solr extract text from it and then index it.
>
>  = Concepts =
>
> @@ -18, +18 @@
>
>
>   * Tika will automatically attempt to determine the input document type (word, pdf,
etc.) and extract the content appropriately. If you want, you can explicitly specify a MIME
type for Tika wth the stream.type parameter
>   * Tika does everything by producing an XHTML stream that it feeds to a SAX !ContentHandler.
> -  * Solr then implements a !SolrContentHandler which reacts to Tika's SAX events and
creates a !SolrInputDocument.  You can override the !SolrContentHandler.  See the section
below on Customization.
> -  * Tika produces Metadata information according to things like !DublinCore and other
specifications.  See the Tika javadocs on the Metadata class for what gets produced.  <!>
TODO: Link to Tika Javadocs <!>  See also http://lucene.apache.org/tika/formats.html
> +  * Solr then reacts to Tika's SAX events and creates the fields to index.
> +  * Tika produces Metadata information such as Title, Subject, and Author, according
to specifications like !DublinCore.  See http://lucene.apache.org/tika/formats.html for the
file types supported.
> +  * All of the extracted text is added to the "content" field
>   * We can map Tika's metadata fields to Solr fields.  We can boost these fields
> -  * We can also pass in literals.
> +  * We can also pass in literals for field values.
> +  * We can apply an XPath expression to the Tika XHTML to restrict the content that
is produced.
> -  * We can apply an XPath expression to the Tika XHTML by passing in the ext.xpath
parameter (described below).  This restricts down the events that are given to the !SolrContentHandler.
 It is still up to the !SolrContentHandler to process those events.
> -  * Field boosts are applied after name mapping
> -  * It is useful to keep in mind what a given operation is using for input when specifying
parameters.  For instance, captured fields are specified to the !SolrContentHandler for capturing
content in the Tika XHTML.  Thus, the names of the fields are those of the XHTML, not the
mapped names.
> -  * A default field name is required for indexing, but not for extraction only.
> -  * The default field name and any literal values are not mapped.  They can be boosted.
 See the examples.
> -
> -
> - == When To Use ==
> -
> - The !ExtractingRequestHandler can be used any time you have the need to index both
the metadata and text of binary documents like Word, PDF, etc.  It doesn't, however, make
sense to use it if you are only interested in indexing the metadata about documents, since
it will be much faster to determine the metadata on the client side and then send that as
a normal Solr document.  In fact, it might make sense for someone to write a piece for SolrJ
that uses Tika on the client-side to construct Solr documents.
>
>  = Getting Started with the Solr Example =
> +  * Check out Solr trunk or get a 1.4 release or later.
> +  * If using a check out, running "ant example" will build the necessary jars.
> + Now start the solr example server:
> + {{{
> + cd example
> + java -jar start.jar
> + }}}
>
> -  * Check out Solr trunk or get a 1.4 release or later if it exists.
> -  * If using a check out, running "ant example" will build the necessary jars.
> -  * cd example
> -  * The example directory comes with all required libs, but the configuration files
are not setup for the !ExtractingRequestHandler. Add the Configuration as defined below to
the example's solrconfig.xml.
> -   *''recent versions of the solr code from svn, do contain a configuration section
within example/solr/conf/solrconfig.xml but it needs uncommented.''
> -  * java -jar start.jar
> -  * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in example/solr/solr.xml
in order for Solr to find the jars in example/solr/lib
> + In a separate window go to the {{{site/}}} directory (which contains some nice example
docs) and send Solr a file via HTTP POST:
> + {{{
> + cd site
> + curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F
"myfile=@tutorial.html"
> + }}}
> +  * Note, the /site directory in the solr download contains some nice example docs
to try
> +  * hint: myfile=@tutorial.html needs a valid path (absolute or relative), e.g. "myfile=@../../site/tutorial.html"
if you are still in exampledocs dir.
> +  * the {{{literal.id=doc1}}} param provides the necessary unique id for the document
being indexed
> +  * the {{{commit=true}}} param causes Solr to do a commit after indexing the document,
making it immediately searchable.  For good performance when loading many documents, don't
call commit until you are done.
> +  * using "curl" or other command line tools to post documents to Solr is nice for
testing, but not the recommended update method for best performance.
>
> + Now, you should be able to execute a query and find that document (open the following
link in your browser):
> + http://localhost:8983/solr/select?q=tutorial
>
> - In a separate window, post a file:
> + You may notice that although you can search on any of the text in the sample document,
you may not be able to see that text when the document is retrieved.  This is simply because
the "content" field generated by Tika is mapped to the Solr field called "text" (which is
indexed but not stored) via the default map rule in {{{solrconfig.xml}}} that can be changed
or overridden.  For example, to store and see all metadata and content, execute the following:
> + {{{
> + curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&map.content=attr_content&commit=true'
-F "myfile=@tutorial.html"
> + }}}
> + And then query via http://localhost:8983/solr/select?q=attr_content:tutorial
>
> + // TODO: move this somewhere else to a more in-depth discussion of different ways to
send the data to Solr (prob with remoteStreaming discussion)
> -  *  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 -F "myfile=@tutorial.html" //Note, the trunk/site contains some nice example docs
> -   * hint: myfile=@tutorial.html needs a valid path (absolute or relative), e.g. "myfile=@../../site/tutorial.html"
if you are still in exampledocs dir.
> -   * with recent svn, you may need to add a unique '''id''' param to curl (see [http://www.nabble.com/Missing-required-field:-id-Using-ExtractingRequestHandler-td22611039.html
nabble msg]):
> -   * e.g. curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.literal.id=123
-F "myfile=@../../site/tutorial.html"
> -
> - or
> -
>   * curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 --data-binary @tutorial.html  -H 'Content-type:text/html'
>         <!> NOTE, this literally streams the file, which does not, then, provide
info to Solr about the name of the file.
>
> - or whatever other way you know how to do it.  Don't forget to COMMIT!
> -  * e.g. curl "http://localhost:8983/solr/update/" -H "Content-Type: text/xml" --data-binary
'<commit waitFlush="false"/>'   --see [http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source
LucidImagination note]
>
>  If you are not working from the supplied example/solr directory you must copy all libraries
from example/solr/libs into a libs directory within your own solr directory. The !ExtractingRequestHandler
is not incorporated into the solr war file, you have to install it separately.
> +
> + = Input Parameters =
> +
> +  * ext.boost.<NAME> = Float -  Boost the field with the specified name.  The
NAME value is the name of the Solr field (not the Tika metadata name).
> +  * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately
for adding to the Solr document.  This can be useful for grabbing chunks of the XHTML into
a separate field.  For instance, it could be used to grab paragraphs (<p>) and index
them into a separate field.  Note that content is also still captured into the overall string
buffer.
> +  * ext.def.fl = <NAME> - The name of the field to add the default content to.
 See also ext.capture below.  This NAME is not mapped, but it can be boosted.
> +  * ext.extract.only = true|false - Default is false.  If true, return the extracted
content from Tika without indexing the document.  This literally includes the extracted XHTML
as a <str> in the response.  See TikaExtractOnlyExampleOutput.
> +  * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields,
named after the attribute.  For example, when extracting from HTML, Tika can return the href
values of <a> tags as attributes of a tag name.  See the examples below.
> +  * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that
are extracted but are not in the Solr Schema.  Otherwise, an exception will be thrown for
fields that are not mapped.
> +  * ext.literal.<NAME> = <VALUE> - Create a field on the document with
field name NAME and literal value VALUE, e.g. ext.literal.foo=bar.  May be multivalued if
the Field is multivalued.  Otherwise, the ERH will throw an exception.
> +  * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika metadata
attribute to a Solr field name.  If no mapping is specified, the metadata attribute will
be used as the field name.  If the field name doesn't exist, it can be ignored by setting
the "ignore undeclared fields" (ext.ignore.und.fl) attribute described below
> +  * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, such
that it is easy to map new metadata fields to dynamic fields
> +  * ext.resource.name=<File Name> - Optional.  The name of the file.  Tika
can use it as a hint for detecting mime type.
> +  * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML
content that satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
>
>  = Configuration =
>
> @@ -104, +118 @@
>
>  EEE MMM d HH:mm:ss yyyy
>  }}}
>
> - = Input Parameters =
> + == MultiCore config
> +  * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in example/solr/solr.xml
in order for Solr to find the jars in example/solr/lib
>
> -  * ext.boost.<NAME> = Float -  Boost the field with the specified name.  The
NAME value is the name of the Solr field (not the Tika metadata name).
> -  * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately
for adding to the Solr document.  This can be useful for grabbing chunks of the XHTML into
a separate field.  For instance, it could be used to grab paragraphs (<p>) and index
them into a separate field.  Note that content is also still captured into the overall string
buffer.
> -  * ext.def.fl = <NAME> - The name of the field to add the default content to.
 See also ext.capture below.  This NAME is not mapped, but it can be boosted.
> -  * ext.extract.only = true|false - Default is false.  If true, return the extracted
content from Tika without indexing the document.  This literally includes the extracted XHTML
as a <str> in the response.  See TikaExtractOnlyExampleOutput.
> -  * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields,
named after the attribute.  For example, when extracting from HTML, Tika can return the href
values of <a> tags as attributes of a tag name.  See the examples below.
> -  * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that
are extracted but are not in the Solr Schema.  Otherwise, an exception will be thrown for
fields that are not mapped.
> -  * ext.literal.<NAME> = <VALUE> - Create a field on the document with
field name NAME and literal value VALUE, e.g. ext.literal.foo=bar.  May be multivalued if
the Field is multivalued.  Otherwise, the ERH will throw an exception.
> -  * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika metadata
attribute to a Solr field name.  If no mapping is specified, the metadata attribute will
be used as the field name.  If the field name doesn't exist, it can be ignored by setting
the "ignore undeclared fields" (ext.ignore.und.fl) attribute described below
> -  * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, such
that it is easy to map new metadata fields to dynamic fields
> -  * ext.resource.name=<File Name> - Optional.  The name of the file.  Tika
can use it as a hint for detecting mime type.
> -  * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML
content that satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
>
>  = Metadata =
>
> @@ -171, +175 @@
>
>  See TikaExtractOnlyExampleOutput.
>
>
> + == Additional Resources ==
> + * [http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source
Lucid Imagination article]
> + * [http://lucene.apache.org/tika/formats.html Supported document formats via Tika]
> - = Customizing =
> -
> - While the current !ExtractingRequestHandler only allows for the use of the !SolrContentHandler
in creating new documents, it is relatively easy to implement your own extension that processes
the Tika extracted content differently and produces a different !SolrInputDocument.
> -
> - To do this, implement your own instance of the !SolrContentHandlerFactory and override
the createFactory() method on the !ExtractingRequestHandler.
>
>  = What's in a Name =
>
>

Mime
View raw message