lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: How to avoid duplicate records in lucene
Date Tue, 22 Jul 2008 18:22:13 GMT
NP, if my original reply had included my second one, then you'd have
known what I was talking about <G>...

I *love* it when I unknowingly demonstrate the issue I'm trying to clarify
<G>.

Best
Erick

On Tue, Jul 22, 2008 at 2:09 PM, mark harwood <markharw00d@yahoo.co.uk>
wrote:

> >>Well, the point of my question was to insure that we were all using
> common terms.
>
> Sorry, Erick. I thought your "define duplicate" question was asking me
> about DuplicateFilter's concept of duplicates rather than asking the
> original poster about his notion of what a duplicate document meant to him.
> You're right it would be useful to understand more about the intention of
> the original message.
>
> Cheers
> Mark
>
>
>
>
>
> ----- Original Message ----
> From: Erick Erickson <erickerickson@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 22 July, 2008 2:37:50 PM
> Subject: Re: How to avoid duplicate records in lucene
>
> Well, the point of my question was to insure that we were all using common
> terms. For all we know, the original questioner considered "duplicate"
> records ones that had identical, or even similar text. Nothing in the
> original question indicated any de-dup happening.
>
> I've often found that assumptions that we are all talking about the same
> thing are...er...incorrect. And I don't want to waste my time answering
> questions that weren't what was asked......
>
> Best
> Erick
>
> On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <markharw00d@yahoo.co.uk>
> wrote:
>
> > >>could you define duplicate?
> >
> > That's your choice of field that you want to de-dup on.
> > That could be a field such as "DatabasePrimaryKey" or perhaps a field
> > containing an MD5 hash of document content.
> > The DuplicateFilter ensures only one document can exist in results for
> each
> > unique value for the choice of field.
> >
> > Cheers
> > Mark
> >
> > Erick Erickson wrote:
> >
> >> could you define duplicate? As far as I know, you don't
> >> get the same (internal) doc id back more than once, so what
> >> is a duplicate?
> >>
> >> Best
> >> Erick
> >>
> >> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech@gmail.com> wrote:
> >>
> >>
> >>
> >>> at the time search , while querying the data
> >>> markrmiller wrote:
> >>>
> >>>
> >>>> Sebastin wrote:
> >>>>
> >>>>
> >>>>> Hi All,
> >>>>>
> >>>>> Is there any possibility to avoid duplicate records in lucene  2.3.1?
> >>>>>
> >>>>>
> >>>>>
> >>>> I don't believe that there is a very high performance way to do this.
> >>>> You are basically going to have to query the index for an id before
> >>>> adding a new doc. The best way I can think of off the top of my head
> is
> >>>> to batch - first check that ids in the batch are unique, then check
> all
> >>>> ids in the batch against the IndexReader, then add the ones that are
> not
> >>>> dupes. Of course all of your docs would have to be added through this
> >>>> single choke point so that you knew other threads had not added that
> id
> >>>> after the first thread had looked but before it added the doc.
> >>>>
> >>>> I think Mark H has you covered if getting the dupes out after are
> okay.
> >>>>
> >>>> - Mark
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> --
> >>> View this message in context:
> >>>
> >>>
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
> >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>  ------------------------------------------------------------------------
> >>
> >> No virus found in this incoming message.
> >> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 -
> Release
> >> Date: 20/07/2008 12:59
> >>
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
>       __________________________________________________________
> Not happy with your email address?.
> Get the one you really want - millions of new email addresses available now
> at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message