lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley
Date Fri, 17 Jul 2009 16:56:03 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/ExtractingRequestHandler

The comment on the change is:
finished describing input params

------------------------------------------------------------------------------
  
  = Input Parameters =
   * map.<source_field>=<target_field> - Maps (moves) one field name to another.
 Example: {{{map.content=text}}} will cause the content field normally generated by Tika to
be moved to the "text" field.
-  * boost.<fieldname>=<float>-  Boost the specified field.
+  * boost.<fieldname>=<float> -  Boost the specified field.
   * literal.<fieldname>=<value> - Create a field with the specified value. May
be multivalued if the Field is multivalued.
-  * uprefix=<prefix> - Prefix all fields that are not defined in the schema with the
given prefix.  This is very useful when combined with dynamic field definitions.  For example
{{{uprefix=ignored_}}} would effectively ignore all unknown metadata fields generated by Tika
given the example schema contains {{{<dynamicField name="ignored_*" type="ignored"/>}}}
+  * uprefix=<prefix> - Prefix all fields that are not defined in the schema with the
given prefix.  This is very useful when combined with dynamic field definitions.  Example:
{{{uprefix=ignored_}}} would effectively ignore all unknown fields generated by Tika given
the example schema contains {{{<dynamicField name="ignored_*" type="ignored"/>}}}
+  * extractOnly=true|false - Default is false.  If true, return the extracted content from
Tika without indexing the document.  This literally includes the extracted XHTML as a string
in the response.  When viewing manually, it may be useful to use a response format other than
XML to aid in viewing the embedded XHTML tags. See TikaExtractOnlyExampleOutput.
+  * resource.name=<File Name> - The optional name of the file.  Tika can use it as
a hint for detecting mime type.
+  * capture=<Tika XHTML NAME> - Capture XHTML elements with the name separately for
adding to the Solr document.  This can be useful for grabbing chunks of the XHTML into a separate
field.  For instance, it could be used to grab paragraphs (<p>) and index them into
a separate field.  Note that content is also still captured into the overall "content" field.
+  * captureAttr=true|false - Index attributes of the Tika XHTML elements into separate fields,
named after the element.  For example, when extracting from HTML, Tika can return the href
attributes in <a> tags as fields named "a". See the examples below.
+  * xpath=<XPath expression> - When extracting, only return Tika XHTML content that
satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html for
details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
+  * lowernames=true|false - Map all field names to lowercase with underscores.  For example,
Content-Type would be mapped to content_type.
  
- WORK IN PROGRESS
+ == Order of field operations ==
+  1. fields are generated by Tika or passed in as literals via {{{literal.fieldname=value}}}
+  1. if lowernames==true, fields are mapped to lower case
+  1. mapping rules {{{map.source=target}}} are applied
+  1. unknown field names are prefixed with the value of {{{uprefix}}}
  
-  * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately for adding
to the Solr document.  This can be useful for grabbing chunks of the XHTML into a separate
field.  For instance, it could be used to grab paragraphs (<p>) and index them into
a separate field.  Note that content is also still captured into the overall string buffer.
-  * ext.def.fl = <NAME> - The name of the field to add the default content to.  See
also ext.capture below.  This NAME is not mapped, but it can be boosted.
-  * ext.extract.only = true|false - Default is false.  If true, return the extracted content
from Tika without indexing the document.  This literally includes the extracted XHTML as a
<str> in the response.  See TikaExtractOnlyExampleOutput.
-  * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields, named
after the attribute.  For example, when extracting from HTML, Tika can return the href values
of <a> tags as attributes of a tag name.  See the examples below.
-  * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that are extracted
but are not in the Solr Schema.  Otherwise, an exception will be thrown for fields that are
not mapped.
-  * ext.literal.<NAME> = <VALUE> - Create a field on the document with field
name NAME and literal value VALUE, e.g. ext.literal.foo=bar.  May be multivalued if the Field
is multivalued.  Otherwise, the ERH will throw an exception.
  
+ -------------------------- UNDER CONSTRUCTION BELOW THIS POINT -------------------------
-  * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, such that
it is easy to map new metadata fields to dynamic fields
-  * ext.resource.name=<File Name> - Optional.  The name of the file.  Tika can use
it as a hint for detecting mime type.
-  * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML content
that satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
  
  = Configuration =
  

Mime
View raw message