lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <karsten.kon...@dacos.com>
Subject AW: Ideas Needed - Finding Duplicate Documents
Date Sun, 12 Jun 2005 17:16:43 GMT
 
Hi David,

>>
I would like to poll the community's opinion on good strategies for identifying
duplicate documents in a lucene index.
>>

Do you mean 100% duplicates or some kind of similarity?

>>
Obviously the brute force method of pairwise compares would take forever. I have tried
grouping sentences using their hashCodes() and then do a pairwise compare between
sentences that has the same hashCode, but even with a 1GB heap I ran out of memory
after comparing 200k sentences.
>>

If you are only after 100% duplicates, you are on the right track with
hash code.

You could encode the hash code of the strings into the index by adding it into a
separate field - your analyzer must index numbers for this! Then, iterate over all
tokens of that field, retrieving each document enumerator; wherever you find more than
one document, do the pairwise comparision as usual. This way, you should never need to
compare more than a few documents.

All the best,

Karsten

--

Dr.-Ing. Karsten Konrad

Research & Development
DACOS Software GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbr├╝cken
http://www.dacos.com

Tel: ++49/ (0) 681 - 302 64834
Fax: ++49/ (0) 681 - 302 64827



-----Urspr├╝ngliche Nachricht-----
Von: Dave Kor [mailto:s0454888@sms.ed.ac.uk] 
Gesendet: Sonntag, 12. Juni 2005 16:38
An: java-user@lucene.apache.org
Betreff: Ideas Needed - Finding Duplicate Documents

Hi,

I would like to poll the community's opinion on good strategies for identifying
duplicate documents in a lucene index.

You see, I have an index containing roughly 25 million lucene documents. My task
requires me to work at sentence level so each lucene document actually contains exactly
one sentence. The issue I have right now is that sometimes, certain sentences are
duplicated and I'ld like to be able to identify them as a BitSet so that I can filter
away these duplicates in my search.

Obviously the brute force method of pairwise compares would take forever. I have tried
grouping sentences using their hashCodes() and then do a pairwise compare between
sentences that has the same hashCode, but even with a 1GB heap I ran out of memory
after comparing 200k sentences.

Any other ideas?


Regards
Dave Kor.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message