lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: A good signature class for deduplication
Date Thu, 01 Sep 2011 17:40:42 GMT

: I want to deduplicate documents from search results. What should be the
: parameters on which I should decide an efficient SignatureClass? Also, what
: are the SignaureClasses available?

the signature classes available are the ones mentioned on the wiki...

...which one you should choose, and which fields you feed it depend 
entirely on your goal -- if you want to deduplicate anytime both the 
"user_fname" and "user_lname" fields are exactly the same, then use those 
fields with either the MD5Signature  or the Lookup3Signature -- (lookup3 
is faster, but some people want MD5 because they want to use the computed 
MD5 for other things)

if you want to detext when some much longer "body" field containing a lot 
of full test is *nearly* identical, then you should consider the 
TextProfileSignature -- how exactly it works and how you tune it i 
don't know off the top of my head.


View raw message