lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arnaud gaudinat <arnaud.gaudi...@gmail.com>
Subject Is deduplication possible during Tika extract?
Date Fri, 14 Jan 2011 13:15:53 GMT
Hello,

here is an excerpt of my solrconfig.xml:

<requestHandler name="/update/extract" 
class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
startup="lazy">
<lst name="defaults">

<str name="update.processor">dedupe</str>

<!-- All the main content goes into "text"... if you need to return
            the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>

<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

and

<updateRequestProcessorChain name="dedupe">
<processor 
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">signature</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">text</str>
<str 
name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

deduplication works when I use only "/update" but not when solr does an 
extract with Tika!
Is deduplication possible during Tika extract?

Thanks in advance,
Arno


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message