lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <>
Subject Ideas Needed - Finding Duplicate Documents
Date Sun, 12 Jun 2005 14:37:50 GMT

I would like to poll the community's opinion on good strategies for identifying
duplicate documents in a lucene index.

You see, I have an index containing roughly 25 million lucene documents. My task
requires me to work at sentence level so each lucene document actually contains
exactly one sentence. The issue I have right now is that sometimes, certain
sentences are duplicated and I'ld like to be able to identify them as a BitSet
so that I can filter away these duplicates in my search.

Obviously the brute force method of pairwise compares would take forever. I have
tried grouping sentences using their hashCodes() and then do a pairwise compare
between sentences that has the same hashCode, but even with a 1GB heap I ran
out of memory after comparing 200k sentences.

Any other ideas?

Dave Kor.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message