lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <>
Subject Re: Checking for duplicates inside index
Date Mon, 22 May 2006 22:04:35 GMT
On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
> I'm indexing ~10000 documents per day but since I'm getting a lot of 
> real duplicates (100% the same document content) I want to check the 
> content before indexing...
> My idea is to create a checksum of the documents content and store it 
> within document inside the index, before indexing a new document I
> will compare the new documents checksum with the ones in the index.
> Is that a good idea? does someone have experiences with that method?
> any tools available? 

That could work.

You will need a big sum though. MD5?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message