lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by GrantIngersoll
Date Wed, 16 Sep 2009 15:15:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/solr/ExtractingRequestHandler

------------------------------------------------------------------------------
  
  You may notice that although you can search on any of the text in the sample document, you
may not be able to see that text when the document is retrieved.  This is simply because the
"content" field generated by Tika is mapped to the Solr field called "text", which is indexed
but not stored. This is done via the default map rule in the {{/udate/extract}}} handler in
{{{solrconfig.xml}}} and can be easily changed or overridden. For example, to store and see
all metadata and content, execute the following:
  {{{
- curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&map.content=attr_content&commit=true'
-F "myfile=@tutorial.html"
+ curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true'
-F "myfile=@tutorial.html"
  }}}
   * The {{{uprefix=attr_}}} param causes all generated fields that aren't defined in the
schema to be prefixed with attr_ (which is a dynamic field that is stored).
-  * The {{{map.content=attr_content}}} param overrides the default {{{map.content=text}}}
causing the content to be added to the attr_content field instead.
+  * The {{{fmap.content=attr_content}}} param overrides the default {{{fmap.content=text}}}
causing the content to be added to the attr_content field instead.
  
   
  And then query via http://localhost:8983/solr/select?q=attr_content:tutorial
  
  = Input Parameters =
-  * map.<source_field>=<target_field> - Maps (moves) one field name to another.
 Example: {{{map.content=text}}} will cause the content field normally generated by Tika to
be moved to the "text" field.
+  * fmap.<source_field>=<target_field> - Maps (moves) one field name to another.
 Example: {{{fmap.content=text}}} will cause the content field normally generated by Tika
to be moved to the "text" field.
   * boost.<fieldname>=<float> -  Boost the specified field.
   * literal.<fieldname>=<value> - Create a field with the specified value. May
be multivalued if the Field is multivalued.
   * uprefix=<prefix> - Prefix all fields that are not defined in the schema with the
given prefix.  This is very useful when combined with dynamic field definitions.  Example:
{{{uprefix=ignored_}}} would effectively ignore all unknown fields generated by Tika given
the example schema contains {{{<dynamicField name="ignored_*" type="ignored"/>}}}
@@ -75, +75 @@

  == Order of field operations ==
   1. fields are generated by Tika or passed in as literals via {{{literal.fieldname=value}}}
   1. if lowernames==true, fields are mapped to lower case
-  1. mapping rules {{{map.source=target}}} are applied
+  1. mapping rules {{{fmap.source=target}}} are applied
   1. unknown field names are prefixed with the value of {{{uprefix}}}
  
  = Configuration =
@@ -88, +88 @@

  {{{
  <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
      <lst name="defaults">
-       <str name="ext.map.Last-Modified">last_modified</str>
+       <str name="fmap.Last-Modified">last_modified</str>
-       <bool name="ext.ignore.und.fl">true</bool>
      </lst>
      <!--Optional.  Specify a path to a tika configuration file.  See the Tika docs for
details.-->
      <str name="tika.config">/my/path/to/tika.config</str>
@@ -149, +148 @@

  Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
  
  {{{
-  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div
 -F "tutorial=@tutorial.pdf"
+  curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div
 -F "tutorial=@tutorial.pdf"
  }}}
  
  == Mapping, Capture and Boost ==
  Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
 Boost foo_t by 3.
  {{{
- curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&map.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=tutorial.pdf
 -F "tutorial=@tutorial.pdf"
+ curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=tutorial.pdf
 -F "tutorial=@tutorial.pdf"
  }}}
  
  == Literals ==
  
  To add in your own metadata, pass in the literal parameter along with the file:
  {{{
- curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&map.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=id\&literal.blah_s=Bah
 -F "tutorial=@tutorial.pdf"
+ curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=id\&literal.blah_s=Bah
 -F "tutorial=@tutorial.pdf"
  }}}
  
  == XPath ==
@@ -170, +169 @@

  Restrict down the XHTML returned by Tika by passing in an XPath expression
  
  {{{
- curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&map.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=id\&\&xpath=\/xhtml:html\/xhtml:body\/xhtml:div\/descendant:node\(\)
 -F "tutorial=@tutorial.pdf"
+ curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=id\&\&xpath=\/xhtml:html\/xhtml:body\/xhtml:div\/descendant:node\(\)
 -F "tutorial=@tutorial.pdf"
  }}}
  
  == Extract Only ==

Mime
View raw message