lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by GrantIngersoll
Date Sat, 15 Nov 2008 16:37:30 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/solr/ExtractingRequestHandler

------------------------------------------------------------------------------
  
  = Getting Started =
  
+ == Prior to Patch Being Committed ==
   * Check out Solr trunk
   * Apply the patch: patch -p 0 -i <PATH TO SOLR-282.patch> [--dry-run]
   * Untar copy http://people.apache.org/~gsingers/extraction-libs.tar in the trunk/contrib/extraction/
directory (thus creating a lib directory there)
   * ant clean example  // build the example
   * cd example
   * java -jar start.jar
+ 
+ == After the Patch is Committed ==
+  * Check out Solr trunk or get a 1.4 release or later if it exists
+  * As above starting with the "ant clean example" step.
  
  In a separate window, post a file:
  
@@ -128, +133 @@

  
  = Examples =
  
+ <!> NOTE: All the examples are run using curl on the command line, so there are extra
escapes ("\") in the URL.
+ 
  == Mapping and Capture ==
  
  Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
@@ -149, +156 @@

   curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1
 -F "tutorial=@tutorial.pdf"
  }}}
  
+ == XPath ==
+ 
+ Restrict down the XHTML returned by Tika by passing in an XPath expression
+ 
+ {{{
+  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.xpath=\/xhtml:html\/xhtml:body\/xhtml:div\/descendant:node\(\)
 -F "tutorial=@tutorial.pdf"
+ }}}
+ 
  == Extract Only ==
  {{{
  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.extract.only=true
 --data-binary @tutorial.html  -H 'Content-type:text/html'

Mime
View raw message