lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: near duplicates
Date Tue, 24 Oct 2006 15:03:33 GMT
Beto Siless wrote:
> Hi Andrej!
>
> I'm taking a look to fuzzy signatures for near duplicate detection and 
> and I have seen your TextProfileSignature. The question is: If I index 
> the documents with their text signature, is there a way to filter near 
> duplicates at search time without comparing each document with all other?


Well, Nutch uses DeleteDuplicates tool for this, which basically sorts 
and compares these signatures and removes all duplicates from the index. 
Then, when searching, there are simply no duplicates anymore because 
they have been already deleted.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message