lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley
Date Sat, 12 Mar 2011 22:36:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by YonikSeeley.
The comment on this change is: switch examples to multivalued txt field.
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=68&rev2=69

--------------------------------------------------

  
  = Examples =
  == Mapping and Capture ==
- Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
+ Capture <div> tags separate, and then map that field to a dynamic field named foo_txt.
  
  {{{
-  curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div"
 -F "tutorial=@tutorial.pdf"
+  curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_txt&capture=div"
 -F "tutorial=@tutorial.pdf"
  }}}
  == Mapping, Capture and Boost ==
- Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
 Boost foo_t by 3.
+ Capture <div> tags separate, and then map that field to a dynamic field named foo_txt.
 Boost foo_txt by 3.
  
  {{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3"
-F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3"
-F "tutorial=@tutorial.pdf"
  }}}
  == Literals ==
  To add in your own metadata, pass in the literal parameter along with the file:
  
  {{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah"
 -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.blah_s=Bah"
 -F "tutorial=@tutorial.pdf"
  }}}
  == XPath ==
  Restrict down the XHTML returned by Tika by passing in an XPath expression
  
  {{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()"
 -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()"
 -F "tutorial=@tutorial.pdf"
  }}}
  == Extract Only ==
  {{{
  curl "http://localhost:8983/solr/update/extract?&extractOnly=true"  --data-binary @tutorial.html
 -H 'Content-type:text/html'
  }}}
- A the output includes XML generated by Tika (and is hence further escaped by Solr's XML)
using a different output format enhance the readability:
+ A the output includes XML generated by Tika and is thus further escaped by Solr's XML format.
Using a different output format like json or ruby enhances the readability:
  
  {{{
  curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true"
 --data-binary @tutorial.html  -H 'Content-type:text/html'

Mime
View raw message