lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omar Didi" <od...@Cyveillance.com>
Subject RE: Checking for duplicates inside index
Date Tue, 23 May 2006 01:47:07 GMT
you have two choices that I can think of:
1- before adding a document, check if it does't exist in the index. you can do this by querying
on a unique field if you have it .
2- you can index all your documents, and once the indexing is done you can dedupe. (Lucene
has built in methods that can help with this)


if your index doesn't have a unique key, you need to add one like the one you suggested.

-----Original Message-----
From: karl wettin [mailto:kalle@snigel.net]
Sent: Monday, May 22, 2006 6:05 PM
To: java-user@lucene.apache.org
Subject: Re: Checking for duplicates inside index


On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
> 
> I'm indexing ~10000 documents per day but since I'm getting a lot of 
> real duplicates (100% the same document content) I want to check the 
> content before indexing...
> 
> My idea is to create a checksum of the documents content and store it 
> within document inside the index, before indexing a new document I
> will compare the new documents checksum with the ones in the index.
> 
> Is that a good idea? does someone have experiences with that method?
> any tools available? 

That could work.

You will need a big sum though. MD5?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message