lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Eliminate duplicates
Date Sun, 18 Mar 2007 22:28:09 GMT
But I think you have a problem here with searching the lucene
index and deleting duplicate titles. Say you have the
following titles:
title one
title one is a nice file
title one is a really nice file

Further assume you're about to add a duplicate "title one"

Searching on "title one" will give you three hits and you want
one, assuming you're using an analyzer that breaks on
whitespace. Are you indexing something (perhaps not
shown in your snippet) that uniquely identifies each title?

What you may have to do is index another field UN_TOKENIZED
that contains the entire title, or some clever representation of
the title that's more compact in order to handle this issue.

BTW, instead of searching with a query, it might be faster
to use TermEnum on your unique field. If TermEnum finds
a term like the one you're about to add, you already have
the title and can either delete the one already in the index
(see TermDocs) or just not add the current one....

I've had cases where we update documents already in the index,
so I needed a FIFO clean-up step. What I did was just index
*everything*, without trying to determine whether a document
was already in the index. Then, as a post-index step,
use TermDocs/TermEnum to make a single pass through
the index deleting all but the last instance of a document.
If you elect to do this, the trick is to construct your
term like new Term("field", "") to enumerate ALL the terms...

Best
Erick

On 3/18/07, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
>
> Markus,
> What you were thinking is fine - search and, if found, delete first, then
> add.  Lucene allows duplicate and offers no automated way for avoiding them.
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: Markus Franz <mail@markus-franz.de>
> To: java-user@lucene.apache.org
> Sent: Sunday, March 18, 2007 3:10:36 PM
> Subject: Eliminate duplicates
>
> Hello!
>
> An excerpt from my Lucene use:
>
> Document document = new Document();
> document.add(new Field("title", item.title, Field.Store.YES, =
> Field.Index.TOKENIZED));
> document.add(new Field("desc", item.desc, Field.Store.YES, =
> Field.Index.TOKENIZED));
> writer.addDocument(document);
> writer.optimize();
>
> My problem is: I can add the same entry twice times. How can I avoid =
> this by checking the title? (Of course I thought of doing a search query =
> for the title before inserting, but isn't there a more cool way?)
>
> Markus
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message