lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <karl.wet...@gmail.com>
Subject Re: near duplicates
Date Tue, 17 Oct 2006 16:36:56 GMT

17 okt 2006 kl. 17.54 skrev Find Me:

> How to eliminate near duplicates from the index?

I would probably try to measure the Ecludian distance between all  
documents, computed on terms and their positions. Or perhaps use  
standard deviation to find the distribution of terms in a document.  
One would based on the output from that try to find a threashold.  
Either way it will consume lots of CPU.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message