lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy the Geek <jt...@adelphia.net>
Subject RE: Checking for duplicates inside index
Date Tue, 23 May 2006 13:56:19 GMT
Any chance I could get my hands on code to "de-dup". I have a current
method I think is quite sub-optimal, as I am searching the index for a
dup on every insert.... Not a very good method I think...

Or any other suggestions on good ways to prevent duplicates? I am
indexing with a field that has a unique ID, so it should be fairly
straightforward...

Thanks.
On Mon, 2006-05-22 at 19:01 -0700, Eugene Tuan wrote:
> I have created a method that can delete duplicate docs. Basically,
> during indexing, a doc is associated with an id (a term field defined by
> you.) that is indexed. Then, call the method to delete duplicates
> whenever you update index. 
> 
> I haven't contributed back to Lucene community yet because our code is
> in beta testing now. 
> 
> My former colleague, Chris, has received agreement from Doug Cutting
> since last August that this feature is nice to have.
> 
> Eugene
> 
> 
> -----Original Message-----
> From: Omar Didi [mailto:odidi@Cyveillance.com] 
> Sent: Monday, May 22, 2006 6:47 PM
> To: java-user@lucene.apache.org
> Subject: RE: Checking for duplicates inside index
> 
> you have two choices that I can think of:
> 1- before adding a document, check if it does't exist in the index. you
> can do this by querying on a unique field if you have it .
> 2- you can index all your documents, and once the indexing is done you
> can dedupe. (Lucene has built in methods that can help with this)
> 
> 
> if your index doesn't have a unique key, you need to add one like the
> one you suggested.
> 
> -----Original Message-----
> From: karl wettin [mailto:kalle@snigel.net]
> Sent: Monday, May 22, 2006 6:05 PM
> To: java-user@lucene.apache.org
> Subject: Re: Checking for duplicates inside index
> 
> 
> On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
> > 
> > I'm indexing ~10000 documents per day but since I'm getting a lot of 
> > real duplicates (100% the same document content) I want to check the 
> > content before indexing...
> > 
> > My idea is to create a checksum of the documents content and store it 
> > within document inside the index, before indexing a new document I
> > will compare the new documents checksum with the ones in the index.
> > 
> > Is that a good idea? does someone have experiences with that method?
> > any tools available? 
> 
> That could work.
> 
> You will need a big sum though. MD5?
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message