lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "ExtractingRequestHandler" by YonikSeeley
Date Thu, 15 Oct 2009 19:39:48 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by YonikSeeley:
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=47&rev2=48

  Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
  
  {{{
-  curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div
 -F "tutorial=@tutorial.pdf"
+  curl http://localhost:8983/solr/update/extract?literal.id=doc2\&captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div
 -F "tutorial=@tutorial.pdf"
  }}}
  
  == Mapping, Capture and Boost ==
  Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
 Boost foo_t by 3.
  {{{
- curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=tutorial.pdf
 -F "tutorial=@tutorial.pdf"
+ curl http://localhost:8983/solr/update/extract?literal.id=doc3\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3
-F "tutorial=@tutorial.pdf"
  }}}
  
  == Literals ==
  
  To add in your own metadata, pass in the literal parameter along with the file:
  {{{
- curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=id\&literal.blah_s=Bah
 -F "tutorial=@tutorial.pdf"
+ curl http://localhost:8983/solr/update/extract?literal.id=doc4\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3\&literal.blah_s=Bah
 -F "tutorial=@tutorial.pdf"
  }}}
  
  == XPath ==
@@ -170, +170 @@

  Restrict down the XHTML returned by Tika by passing in an XPath expression
  
  {{{
- curl http://localhost:8983/solr/update/extract?captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div\&boost.foo_t=3\&literal.id=id\&\&xpath=\/xhtml:html\/xhtml:body\/xhtml:div\/descendant:node\(\)
 -F "tutorial=@tutorial.pdf"
+ curl http://localhost:8983/solr/update/extract?literal.id=doc5\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3\&literal.id=id\&\&xpath=\/xhtml:html\/xhtml:body\/xhtml:div\/descendant:node\(\)
 -F "tutorial=@tutorial.pdf"
  }}}
  
  == Extract Only ==
@@ -178, +178 @@

  curl http://localhost:8983/solr/update/extract?\&extractOnly=true  --data-binary @tutorial.html
 -H 'Content-type:text/html'
  }}}
  
+ A the output includes XML generated by Tika (and is hence further escaped by Solr's XML)
using a different output format enhance the readability:
+ {{{
+ curl http://localhost:8983/solr/update/extract?\&extractOnly=true\&wt=ruby\&indent=true
 --data-binary @tutorial.html  -H 'Content-type:text/html'
+ }}}
+ 
  See TikaExtractOnlyExampleOutput.
  
  = Sending documents to Solr =
  
  // TODO: describe the different ways to send the documents to solr (POST body, form encoded,
remoteStreaming)
-  * curl http://localhost:8983/solr/update/extract?\&defaultField=text  --data-binary
@tutorial.html  -H 'Content-type:text/html'  
+  * curl http://localhost:8983/solr/update/extract?literal.id=doc5\&defaultField=text
 --data-binary @tutorial.html  -H 'Content-type:text/html'  
         <!> NOTE, this literally streams the file, which does not, then, provide info
to Solr about the name of the file.
   * SolrJ:  Use the ContentStreamUpdateRequest (see SolrExampleTests.java for full example):{{{
      ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");

Mime
View raw message