lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <karl.wet...@gmail.com>
Subject Re: Detection of index dublicates in Lucene
Date Mon, 30 Jul 2007 12:46:49 GMT
30 jul 2007 kl. 14.43 skrev Grant Ingersoll:

> I believe Nutch has a duplicate detection algorithm.  I don't know  
> how easy it would be to run independently on a Lucene index.

There have also been a bunch of near-duplicate ideas that have been  
presented on the forums before.

This is one of the threads: <http://www.nabble.com/Checking-for- 
duplicates-inside-index-tf1665494.html>


-- 
karl


>
> -Grant
>
> On Jul 29, 2007, at 2:18 AM, Dmitry wrote:
>
>> We trying to find are any implementation for Lucene  -  detection  
>> index duclicates.
>> Assuming we have a set of documents and a document is a bunch of  
>> words. After we created indexec for the same document we need to  
>> knwo that all ideces will be uniq for specific document. (lexical  
>> equivalency).
>>
>> Can we have like implementation of algorithm  has not determined a  
>> duplicate and another situation when algorithm has offered a false  
>> duplicate. In this case we can find all dublicate indeces.
>>
>> And the same Algorithm we can use to detect Document dublicates -  
>> in this case we save time and can get better performance not to  
>> run indexed services against this document.
>>
>> Please any suggestions will be good.
>>
>> Thanks,
>>
>> DT,
>>
>> www.ejinz.com
>>
>> Search Engine News
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message