lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@activemath.org>
Subject Re: Ideas Needed - Finding Duplicate Documents
Date Mon, 13 Jun 2005 11:04:40 GMT
Have you tried comparing TermVectors ?
I would expect them, or an adjustment of them, to allow comparison to 
focus on "important terms" (e.g. about a 100-200 terms) and then allow 
a more reasonable computation.

paul


Le 12 juin 05, à 16:37, Dave Kor a écrit :

> Hi,
>
> I would like to poll the community's opinion on good strategies for 
> identifying
> duplicate documents in a lucene index.
>
> You see, I have an index containing roughly 25 million lucene 
> documents. My task
> requires me to work at sentence level so each lucene document actually 
> contains
> exactly one sentence. The issue I have right now is that sometimes, 
> certain
> sentences are duplicated and I'ld like to be able to identify them as 
> a BitSet
> so that I can filter away these duplicates in my search.
>
> Obviously the brute force method of pairwise compares would take 
> forever. I have
> tried grouping sentences using their hashCodes() and then do a 
> pairwise compare
> between sentences that has the same hashCode, but even with a 1GB heap 
> I ran
> out of memory after comparing 200k sentences.
>
> Any other ideas?
>
>
> Regards
> Dave Kor.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message