lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: can IndexWriter.addIndexes de-dupe documents?
Date Mon, 22 Feb 2010 23:17:55 GMT
What sorts of rules would govern which one should be
kept? Say you were adding three indexes and there
was a document in each that was identical. Which one
should be kept?

I suspect any rule would be wrong at least part of
the time....

FWIW
Erick

On Mon, Feb 22, 2010 at 5:02 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> addIndexes doesn't make this possible.
>
> Maybe add the indexes but then make a 2nd pass to dedup?
>
> Mike
>
> On Mon, Feb 22, 2010 at 4:26 PM, jchang <jchangkihatest@gmail.com> wrote:
> >
> > When I call IndexWriter.addIndexes, is there anything I can do to make it
> > filter out duplicates based a certain field (or group of fields)?   If I
> > know that the id field of the document is unique, can I make addIndexes
> know
> > that if it finds a new document bat the same id, the new one is valid and
> > the old one should be overwritten (or deleted and the new one added in
> its
> > place)?
> >
> > I don't see anything like unique constraint in the Field class; I know
> > Lucene is not a SQL database, but i just wanted to check to make sure I'm
> > not missing anything.
> >
> >
> > --
> > View this message in context:
> http://old.nabble.com/can-IndexWriter.addIndexes-de-dupe-documents--tp27694763p27694763.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message