lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingUpdateProcessor" by JanHoydahl
Date Thu, 06 Oct 2011 00:28:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingUpdateProcessor" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/ExtractingUpdateProcessor

Comment:
Initial version

New page:
= ExtractingUpdateProcessor =

<!> Currently under development in [[https://issues.apache.org/jira/browse/SOLR-1763|SOLR-1763]]

<<TableOfContents(3)>>

= Introduction =

The !ExtractingUpdateProcessor is an Update Processor capable of extracting text out of rich
documents such as PDFs and MS Office documents and more. It is based on [[http://tika.apache.org/|Apache
Tika]] which has support for [[http://tika.apache.org/0.10/formats.html|more than 30 document
formats]]. The processor is shipped in the {{{solr-extraction}}} contrib module, bundled together
with ExtractingRequestHandler.

!ExtractingUpdateProcessor does the same job as !ExtractingRequestHandler, namely extracting
text from rich documents. But using it as an UpdateProcessor has several benefits over the
RequestHandler approach:
 * Extract text from multiple binary attachments in the same Solr document
 * Better control of which fields to write the output and metadata to
 * Use with any RequestHandler, such as XML, CSV, Binary (SolrJ), DIH etc (since all these
support the UpdateChain)
 * Do more complex integrations, like an UpdateChain which reads a file reference from the
document, then fetches the document from external storage before extraction

= Configuration =
The UpdateRequestProcessor is configured in solrconfig.xml, and supports many parameters.
All parameters listed may also be overridded on the update request itself. A minimal configuration
will read input from a binary field named {{{stream_content}}} and the file name from field
{{{stream_name}}} and output extracted data to fields {{{title}}} and {{{body}}}:
{{{
<processor class="org.apache.solr.update.processor.ExtractingUpdateProcessorFactory" />
}}}


'''NOTE:''' The processor supports the {{{defaults/appends/invariants}}} concept for its
config. However, it is also possible to skip this level and configure the parameters directly
underneath the {{{<processor>}}} tag.

Below follows a list of each configuration parameters and their meaning:

<!> TBD
== a ==
Bla bla

'''Value:''' true/false

'''Default:''' true



= Examples =

== Override input and output fields ==

{{{
<processor class="org.apache.solr.update.processor.ExtractingUpdateProcessorFactory" >
  <str name="in.content.field">binary_content</str>
  <str name="in.filename.field">filename</str>
  <str name="out.title.field">title_en</str>
  <str name="out.body.field">description_en</str>
  <str name="out.mimetype.field">mimetype</str>
</processor>
}}}

= Resources =

 * [[http://tika.apache.org/|Apache Tika]]
 * [[https://issues.apache.org/jira/browse/SOLR-1763|SOLR-1763]]
 * ExtractingRequestHandler

Mime
View raw message