lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Detection of index dublicates in Lucene
Date Mon, 30 Jul 2007 12:43:00 GMT
I believe Nutch has a duplicate detection algorithm.  I don't know  
how easy it would be to run independently on a Lucene index.

-Grant

On Jul 29, 2007, at 2:18 AM, Dmitry wrote:

> We trying to find are any implementation for Lucene  -  detection  
> index duclicates.
> Assuming we have a set of documents and a document is a bunch of  
> words. After we created indexec for the same document we need to  
> knwo that all ideces will be uniq for specific document. (lexical  
> equivalency).
>
> Can we have like implementation of algorithm  has not determined a  
> duplicate and another situation when algorithm has offered a false  
> duplicate. In this case we can find all dublicate indeces.
>
> And the same Algorithm we can use to detect Document dublicates -  
> in this case we save time and can get better performance not to run  
> indexed services against this document.
>
> Please any suggestions will be good.
>
> Thanks,
>
> DT,
>
> www.ejinz.com
>
> Search Engine News
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message