lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "alessandro.rieti@virgilio.it" <alessandro.ri...@virgilio.it>
Subject ExtractingRequestHandler configuration
Date Sun, 05 Dec 2010 15:42:01 GMT
 Hi All,
I added to my solr 1.4.1 instance the ExtractingRequestHandler with the default configuration
that I found on the wiki (http://wiki.apache.org/solr/ExtractingRequestHandler).

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
    <!--Optional.  Specify a path to a tika configuration file.  See the Tika docs for
details.-->
    <!--<str name="tika.config">/my/path/to/tika.config</str>-->

    <!-- Optional. Specify one or more date formats to parse.  See DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
<!--
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>
-->
  </requestHandler>

now when I injest via solrj api the html and pdf document I can find in the solr indexes document
like that:


stored/uncompressed,indexed,tokenized<Content-Type:application/pdf>
stored/uncompressed,indexed,omitNorms<PID:eims-document:25445#objects/eims-document:226946/datastreams/PDF/content>

stored/uncompressed,indexed,tokenized<content:  stream_size 1168557   Content-Type application/pdf
        >
stored/uncompressed,indexed,tokenized<stream_size:1168557>
stored/uncompressed,indexed,omitNorms<timestamp:2010-12-05T12:34:44.423>


How can I add the configuration to strip the PDF/HTML content  and add it to the content field?
In order to update the a document in the index, Is it possible to inject multiple binary object
with the same pid? 

Regards
Alessandro


 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message