lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley
Date Fri, 17 Jul 2009 16:12:20 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/ExtractingRequestHandler

The comment on the change is:
updating parameters - work in progress

------------------------------------------------------------------------------
         <!> NOTE, this literally streams the file, which does not, then, provide info
to Solr about the name of the file.
  
  
- If you are not working from the supplied example/solr directory you must copy all libraries
from example/solr/libs into a libs directory within your own solr directory. The !ExtractingRequestHandler
is not incorporated into the solr war file, you have to install it separately.
+ = Input Parameters =
+  * map.<source_field>=<target_field> - Maps (moves) one field name to another.
 Example: {{{map.content=text}}} will cause the content field normally generated by Tika to
be moved to the "text" field.
+  * boost.<fieldname>=<float>-  Boost the specified field.
+  * literal.<fieldname>=<value> - Create a field with the specified value. May
be multivalued if the Field is multivalued.
+  * uprefix=<prefix> - Prefix all fields that are not defined in the schema with the
given prefix.  This is very useful when combined with dynamic field definitions.  For example
{{{uprefix=ignored_}}} would effectively ignore all unknown metadata fields generated by Tika
given the example schema contains {{{<dynamicField name="ignored_*" type="ignored"/>}}}
  
- = Input Parameters =
+ WORK IN PROGRESS
  
-  * ext.boost.<NAME> = Float -  Boost the field with the specified name.  The NAME
value is the name of the Solr field (not the Tika metadata name). 
   * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately for adding
to the Solr document.  This can be useful for grabbing chunks of the XHTML into a separate
field.  For instance, it could be used to grab paragraphs (<p>) and index them into
a separate field.  Note that content is also still captured into the overall string buffer.
   * ext.def.fl = <NAME> - The name of the field to add the default content to.  See
also ext.capture below.  This NAME is not mapped, but it can be boosted.
   * ext.extract.only = true|false - Default is false.  If true, return the extracted content
from Tika without indexing the document.  This literally includes the extracted XHTML as a
<str> in the response.  See TikaExtractOnlyExampleOutput.
   * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields, named
after the attribute.  For example, when extracting from HTML, Tika can return the href values
of <a> tags as attributes of a tag name.  See the examples below.
   * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that are extracted
but are not in the Solr Schema.  Otherwise, an exception will be thrown for fields that are
not mapped.
   * ext.literal.<NAME> = <VALUE> - Create a field on the document with field
name NAME and literal value VALUE, e.g. ext.literal.foo=bar.  May be multivalued if the Field
is multivalued.  Otherwise, the ERH will throw an exception.
-  * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika metadata attribute
to a Solr field name.  If no mapping is specified, the metadata attribute will be used as
the field name.  If the field name doesn't exist, it can be ignored by setting the "ignore
undeclared fields" (ext.ignore.und.fl) attribute described below
+ 
   * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, such that
it is easy to map new metadata fields to dynamic fields
   * ext.resource.name=<File Name> - Optional.  The name of the file.  Tika can use
it as a hint for detecting mime type.
   * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML content
that satisfies the XPath expression.  See http://lucene.apache.org/tika/documentation.html
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
  
  = Configuration =
+ 
+ If you are not working from the supplied example/solr directory you must copy all libraries
from example/solr/libs into a libs directory within your own solr directory. The !ExtractingRequestHandler
is not incorporated into the solr war file, you have to install it separately.
+ 
  
  Example config:
  

Mime
View raw message