lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hannes Carl Meyer <>
Subject Checking for duplicates inside index
Date Mon, 22 May 2006 21:42:48 GMT
Hi All,

I'm indexing ~10000 documents per day but since I'm getting a lot of 
real duplicates (100% the same document content) I want to check the 
content before indexing...

My idea is to create a checksum of the documents content and store it 
within document inside the index, before indexing a new document I will 
compare the new documents
checksum with the ones in the index.

Is that a good idea? does someone have experiences with that method? any 
tools available?

Thank you and kind regards


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message