lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by JanHoydahl
Date Wed, 04 Jul 2012 21:57:17 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by JanHoydahl:

Info about password protected files

   * xpath=<XPath expression> - When extracting, only return Tika XHTML content that
satisfies the XPath expression.  See for
details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
   * lowernames=true|false - Map all field names to lowercase with underscores.  For example,
Content-Type would be mapped to content_type.
   * literalsOverride=true|false - <!> [[Solr4.0]] When true, literal field values will
override other values with same field name, such as metadata and content. If false, then literal
field values will be appended to any extracted data from Tika, and the resulting field needs
to be multi valued. Default: true
+  * resource.password=<password> - <!> [[Solr4.0]] The optional password for
a password protected PDF or OOXML file. File format support depends on Tika.
+  * passwordsFile=<file name> - <!> [[Solr4.0]] The optional name of a file containing
file name pattern to password mappings. See chapter "Encrypted Files" below
  If extractOnly is true, additional input parameters:
@@ -156, +158 @@

  It is highly recommend that you try using the extract only option to see what values actually
get set for these.
+ = Encrypted files =
+ <!> [[Solr4.0]] By supplying a password in either {{{resource.password}}} on the request,
or in a {{{passwordsFile}}} file, you can have ExtractingRequestHandler decrypt encrypted
files and index their content. In the case of {{{passwordsFile}}} the file supplied must be
on the format: One line per rule, each rule contains a file name regular expression followed
by "=" followed by the password in clear-text (thus this file should have strict access restrictions).
+ {{{
+ # This is a comment
+ myFileName = myPassword
+ .*\.docx$ = myWordPassword
+ .*\.pdf$ = myPdfPassword
+ }}}
  = Examples =
  == Mapping and Capture ==
  Capture <div> tags separate, and then map that field to a dynamic field named foo_txt.
@@ -191, +203 @@

  curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true"
 --data-binary @tutorial.html  -H 'Content-type:text/html'
  See TikaExtractOnlyExampleOutput.
+ == Password protected ==
+ {{{
+ curl "http://localhost:8983/solr/collection1/update/extract?commit=true&"
+      -H "Content-Type: application/pdf" --data-binary @my-encrypted-file.pdf
+ }}}
  = Sending documents to Solr =

View raw message