lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by JanHoydahl
Date Fri, 22 Jun 2012 12:29:27 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=75&rev2=76

Comment:
literalsOverride

   * captureAttr=true|false - Index attributes of the Tika XHTML elements into separate fields,
named after the element.  For example, when extracting from HTML, Tika can return the href
attributes in <a> tags as fields named "a". See the examples below.
   * xpath=<XPath expression> - When extracting, only return Tika XHTML content that
satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html for
details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
   * lowernames=true|false - Map all field names to lowercase with underscores.  For example,
Content-Type would be mapped to content_type.
-  * literalsOverride=true|false - <!> [[Solr4.0]] When true, literal field values will
override other values with same field name, such as metadata and content. Default: true
+  * literalsOverride=true|false - <!> [[Solr4.0]] When true, literal field values will
override other values with same field name, such as metadata and content. If false, then literal
field values will be appended to any extracted data from Tika, and the resulting field needs
to be multi valued. Default: true
  
  If extractOnly is true, additional input parameters:
  
   * extractFormat=xml|text - Default is xml.  Controls the serialization format of the extract
content.  xml format is actually XHTML, like passing the -x command to the tika command line
application, while text is like the -t command.
  
  == Order of field operations ==
-  1. fields are generated by Tika or passed in as literals via {{{literal.fieldname=value}}}
+  1. fields are generated by Tika or passed in as literals via {{{literal.fieldname=value}}}.
<!> Before Solr4.0 or if literalsOverride=false, then literals will be appended as multi-value
to tika generated field.
   1. if lowernames==true, fields are mapped to lower case
   1. mapping rules {{{fmap.source=target}}} are applied
   1. if {{{uprefix}}} is specified, any unknown field names are prefixed with that value,
else if {{{defaultField}}} is specified, unknown fields are copied to that.

Mime
View raw message