lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Stoppelman" <>
Subject Re: Detection of index dublicates in Lucene
Date Mon, 30 Jul 2007 07:12:56 GMT
A couple of thoughts here...

You could hash (e.g.md5) all the documents in your index and eliminate
duplicates that way. Just pick one of the docs in the hash bucket as
the non-dup document and the delete the other dups. This could be run as a
batch job to eliminate the duplicates in an off-line process.

Alternatively, if you don't want to run over the entire index all the time,
you could add a field that contains some kind of hash (e.g. md5) as a field
lucene Document. At query time you could make sure all the Documents
returned are unique.


On 7/28/07, Dmitry <> wrote:
> We trying to find are any implementation for Lucene  -  detection index
> duclicates.
> Assuming we have a set of documents and a document is a bunch of words.
> After we created indexec for the same document we need to knwo that all
> ideces will be uniq for specific document. (lexical equivalency).
> Can we have like implementation of algorithm  has not determined a
> duplicate
> and another situation when algorithm has offered a false duplicate. In
> this
> case we can find all dublicate indeces.
> And the same Algorithm we can use to detect Document dublicates - in this
> case we save time and can get better performance not to run indexed
> services
> against this document.
> Please any suggestions will be good.
> Thanks,
> DT,
> Search Engine News
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message