lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <>
Subject Re: Duplicates recods in index
Date Fri, 10 Feb 2006 02:13:37 GMT
Pasha Bizhan wrote:
> Hi, 
>> From: Daniel Noll [] 
>> I don't know how this will be for efficiency.  If you did it 
>> that way, you would have to re-open the index for every 
>> single document you add, otherwise you might miss a duplicate 
>> which was added recently.
> You do not need to reopen index for every single document if the new 
> data doesn't contain dupes.

True, I didn't think about that.  Our new data always contains dupes as 
the data comes in enormous quantities.  And because it comes in enormous 
quantities, we can't just resolve the duplicates of all the new data on 
the fly, so we store off all the hashes for later.

Another way to do it would be to open a reader on the text index but 
also keep a local copy of the most recently added hashes.  Every 1000 or 
however many documents you could reopen the index and clear your local 
hashset (I've been thinking about this a little lately as hitting the 
text index to find a match is quicker than hitting the database, but in 
all honesty creating the hash takes a lot longer for us.)


Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message