lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject ExtractingRequestHandler configuration
Date Sun, 05 Dec 2010 15:42:01 GMT
 Hi All,
I added to my solr 1.4.1 instance the ExtractingRequestHandler with the default configuration
that I found on the wiki (

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    <!--Optional.  Specify a path to a tika configuration file.  See the Tika docs for
    <!--<str name="tika.config">/my/path/to/tika.config</str>-->

    <!-- Optional. Specify one or more date formats to parse.  See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
    <lst name="date.formats">

now when I injest via solrj api the html and pdf document I can find in the solr indexes document
like that:


stored/uncompressed,indexed,tokenized<content:  stream_size 1168557   Content-Type application/pdf

How can I add the configuration to strip the PDF/HTML content  and add it to the content field?
In order to update the a document in the index, Is it possible to inject multiple binary object
with the same pid? 


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message