lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by GrantIngersoll
Date Fri, 14 Nov 2008 20:03:39 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/solr/ExtractingRequestHandler

------------------------------------------------------------------------------
  
  A common need of users is the ability to ingest binary and/or structured documents such
as Office, PDF and other proprietary formats.  The [http://www.lucene.apache.org/tika Apache
Tika] project provides a framework for wrapping many different file format parsers, such as
PDFBox, POI and others.
  
- The !ExtractingRequestHandler will provide a wrapper around Tika to allow uses to upload
binary files to Solr and have Solr extract text from it and then index it.
+ Solr's !ExtractingRequestHandler provides a wrapper around Tika to allow uses to upload
binary files to Solr and have Solr extract text from it and then index it.
  
  = Getting Started =
  
@@ -21, +21 @@

  
  In a separate window, post a file:
  
-  *  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 -F "myfile=@tutorial.html" 
+  *  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 -F "myfile=@tutorial.html" //Note, the trunk/site contains some nice example docs.
  
  or
  
@@ -29, +29 @@

         <!> NOTE, this literally streams the file, which does not, then, provide info
to Solr about the name of the file, which means the !ExtractingRequestHandler will auto-generate
an ID for the file, unless you specify one by adding a literal value (see below).
  
  or whatever other way you know how to do it.
+ 
+ = Configuration =
+ 
+ Example config:
+ 
+ {{{
+ <requestHandler name="/update/extract" class="solr.ExtractingRequestHandler">
+     <lst name="defaults">
+       <str name="ext.map.Last-Modified">last_modified</str>
+       <bool name="ext.ignore.und.fl">true</bool>
+     </lst>
+     <!--Specify a path to a tika configuration file.  See the Tika docs for details.-->
+     <str name="tika.config">/my/path/to/tika.config</str>
+     <!-- Specify one or more date formats to parse.  See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
+     <lst name="date.formats">
+       <str>yyyy-MM-dd</str>
+     </lst>
+   </requestHandler>
+ }}}
  
  = Input Parameters =
  

Mime
View raw message