lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "ExtractingRequestHandler" by iorixxx
Date Wed, 15 Aug 2012 11:33:56 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by iorixxx:
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=80&rev2=81

Comment:
broken tike documentation link corrected

   * resource.name=<File Name> - The optional name of the file.  Tika can use it as
a hint for detecting mime type.
   * capture=<Tika XHTML NAME> - Capture XHTML elements with the name separately for
adding to the Solr document.  This can be useful for grabbing chunks of the XHTML into a separate
field.  For instance, it could be used to grab paragraphs (<p>) and index them into
a separate field.  Note that content is also still captured into the overall "content" field.
   * captureAttr=true|false - Index attributes of the Tika XHTML elements into separate fields,
named after the element.  For example, when extracting from HTML, Tika can return the href
attributes in <a> tags as fields named "a". See the examples below.
-  * xpath=<XPath expression> - When extracting, only return Tika XHTML content that
satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html for
details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
+  * xpath=<XPath expression> - When extracting, only return Tika XHTML content that
satisfies the XPath expression.  See http://tika.apache.org/1.2/parser.html for details on
the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
   * lowernames=true|false - Map all field names to lowercase with underscores.  For example,
Content-Type would be mapped to content_type.
   * literalsOverride=true|false - <!> [[Solr4.0]] When true, literal field values will
override other values with same field name, such as metadata and content. If false, then literal
field values will be appended to any extracted data from Tika, and the resulting field needs
to be multi valued. Default: true
   * resource.password=<password> - <!> [[Solr4.0]] The optional password for
a password protected PDF or OOXML file. File format support depends on Tika.

Mime
View raw message