lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: near duplicates
Date Tue, 17 Oct 2006 16:55:43 GMT
karl wettin wrote:
>
> 17 okt 2006 kl. 17.54 skrev Find Me:
>
>> How to eliminate near duplicates from the index?
>
> I would probably try to measure the Ecludian distance between all 
> documents, computed on terms and their positions. Or perhaps use 
> standard deviation to find the distribution of terms in a document. 
> One would based on the output from that try to find a threashold. 
> Either way it will consume lots of CPU.


There are better ways to achieve this. You need to create a fuzzy 
signature of the document, based on term histogram or shingles - take a 
look a the Signature framework in Nutch.

There is a substantial literature on this subject - go to Citeseer and 
run a search for "near duplicate detection".

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message