lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lamprecht <clampre...@gmail.com>
Subject Re: Ideas Needed - Finding Duplicate Documents
Date Sun, 12 Jun 2005 18:46:08 GMT
Dave,

Can you define exactly what you consider "duplicate sentences"?  Is it
the same exact string, or the same words in the same order, or the
same words in any order, etc?

If you can normalize each sentence first, so two "duplicate" sentences
are always the exact same string, then you should be able to fit a
hash of each sentence in RAM.  By hash, I mean a
cryptographic-strength hash, such as MD5.  The big difference between
an MD5 hash and a hashCode() hash is you won't get collisions with the
MD5 hash.  (More accurately, the probability of two different objects
having the same MD5 hash is astronomically low).  Each MD5 takes 16
bytes, so 16 * 25M sentences = 400MB of memory.  You can probably get
away with a smaller hash in your case, maybe half of an MD5 (64 bits),
saving you RAM.  You could also store this MD5 in your lucene index as
a keyword field (after converting the 16-byte MD5 value into a 32-byte
hex string, for example), providing a fast way to find duplicates at
search time.

If you can give more details on your requirements, people in this list
can probably come up with some pretty good solutions.

-chris

On 6/12/05, Dave Kor <s0454888@sms.ed.ac.uk> wrote:
> Hi,
> 
> I would like to poll the community's opinion on good strategies for identifying
> duplicate documents in a lucene index.
> 
> You see, I have an index containing roughly 25 million lucene documents. My task
> requires me to work at sentence level so each lucene document actually contains
> exactly one sentence. The issue I have right now is that sometimes, certain
> sentences are duplicated and I'ld like to be able to identify them as a BitSet
> so that I can filter away these duplicates in my search.
> 
> Obviously the brute force method of pairwise compares would take forever. I have
> tried grouping sentences using their hashCodes() and then do a pairwise compare
> between sentences that has the same hashCode, but even with a 1GB heap I ran
> out of memory after comparing 200k sentences.
> 
> Any other ideas?
> 
> 
> Regards
> Dave Kor.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message