lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "Deduplication" by Mark Miller
Date Tue, 18 Nov 2008 14:50:47 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by Mark Miller:
http://wiki.apache.org/solr/Deduplication

------------------------------------------------------------------------------
  Implementations:
  
  || MD5Signature || Used for exact duplicate detection. ||
- || TextProfileSignature || Fuzzy hashing implementation from nutch for near duplicate detection.
||
+ || TextProfileSignature || Fuzzy hashing implementation from nutch for near duplicate detection.
Its tunable but works best on longer text.||
  
  There are other more sophisticated algorithms for fuzzy/near hashing that could be added
later.
  
@@ -45, +45 @@

  
  == solrconfig.xml ==
  
- The DeduplicateUpdateProcessorFactory has to be registered in the solrconfig.xml as part
of the UpdateRequest Chain:
+ The SignatureUpdateProcessorFactory has to be registered in the solrconfig.xml as part of
the UpdateRequest Chain:
  
  Accepting all defaults:
  {{{
    <updateRequestProcessorChain name="dedupe">
      <processor
-       class="org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory">
+       class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
  
      </processor>
      <processor class="solr.RunUpdateProcessorFactory" />
@@ -62, +62 @@

  {{{
    <updateRequestProcessorChain name="dedupe">
      <processor
-       class="org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory">
+       class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
  
          <bool name="enabled">true</bool>
          <str name="fields">field1,field2</str>

Mime
View raw message