Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of peterlkeegan@gmail.com
 designates 64.233.162.232 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=IZDkKQK57eY+45FMsiFt0/itoV0rzgKe7J3BvwHrKi7TTA6xdyaQfHhwXw7A9z+2H4Lji1w/9i9SvL7G6o12CG8EvQ9QGEWFcaDIjl2SQjECunugBvZ6k3LaDnq4YBy570DHRlkuiws958evuxRByoJT+7fzHaepPMjpsExqED0=
Message-ID: <e994873a0703291709r63cb0ba6ufc4f5c78c2eaf2da@mail.gmail.com>
Date: Thu, 29 Mar 2007 20:09:56 -0400
From: "Peter Keegan" <peterlkeegan@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: FieldSortedHitQueue enhancement
In-Reply-To: <460C3890.7010004@teamware.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_51513_4507029.1175213396932"
References: <20070329144800.4892.qmail@web50312.mail.re2.yahoo.com>
	 <e994873a0703290800o7fb39256l3325d2e8f605404f@mail.gmail.com>
	 <460C3890.7010004@teamware.com>

------=_Part_51513_4507029.1175213396932
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

>Peter, how did you achieve 'last wins' as you must presumably remove first
from the PQ?

I implemented 'first wins' because the score is less important than other
fields (distance, in our case), but you make a good point since score may be
more important. How did you implement remove()?

Peter


On 3/29/07, Antony Bowesman <adb@teamware.com> wrote:
>
> I've got a similar duplicate case, but my duplicates are based on an
> external ID
> rather than Doc id so occurs for a single Query.  It's using a custom
> HitCollector but score based, not field sorted.
>
> If my duplicate contains a higher score than one on the PQ I need to
> update the
> stored score with the higher one, so PQ needs a replace() method where the
> stored object.equals() can be used to find the object to delete.  I'm not
> sure
> if there's a way to find the object efficiently in this case other than a
> linear
> search.  I implemented remove().
>
> Peter, how did you achieve 'last wins' as you must presumably remove first
> from
> the PQ?
>
> Antony
>
>
> Peter Keegan wrote:
> > The duplicate check would just be on the doc ID. I'm using TreeSet to
> > detect
> > duplicates with no noticeable affect on performance. The PQ only has to
> be
> > checked for a previous value IFF the element about to be inserted is
> > actually inserted and not dropped because it's less than the least value
> > already in there. So, the TreeSet is never bigger than the size of the
> PQ
> > (typically 25 to a few hundred items), not the size of all hits.
> >
> > Peter
> >
> > On 3/29/07, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
> >>
> >> Hm, removing duplicates (as determined by a value of a specified
> document
> >> field) from the results would be nice.
> >> How would your addition affect performance, considering it has to check
> >> the PQ for a previous value for every candidate hit?
> >>
> >> Otis
> >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> >> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> >>
> >> ----- Original Message ----
> >> From: Peter Keegan <peterlkeegan@gmail.com>
> >> To: java-user@lucene.apache.org
> >> Sent: Thursday, March 29, 2007 9:39:13 AM
> >> Subject: FieldSortedHitQueue enhancement
> >>
> >> This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
> >> that
> >> would prevent duplicate documents from being inserted, or
> alternatively,
> >> allow the application to prevent this (reason explained below). I can
> do
> >> this today by making the 'lessThan' method public and checking the
> queue
> >> before inserting like this:
> >>
> >> if (hq.size() < maxSize) {
> >>    // doc will be inserted into queue - check for duplicate before
> >> inserting
> >> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
> >> (ScoreDoc)hq.top()) {
> >>   // doc will be inserted into queue - check for duplicate before
> >> inserting
> >> } else {
> >>   // doc will not be inserted - no check needed
> >> }
> >>
> >> However, this is just replicating existing code in
> >> PriorityQueue->insert().
> >> An alternative would be to have a method like:
> >>
> >> public boolean wouldBeInserted(ScoreDoc doc)
> >> // returns true if doc would be inserted, without inserting
> >>
> >> The reason for this is that I have some queries that get expanded into
> >> multiple searches and the resulting hits are OR'd together. The queries
> >> contain 'terms' that are not seen by Lucene but are handled by a
> >> HitCollector that uses external data for each document to evaluate
> hits.
> >> The
> >> results from the priority queue should contain no duplicate documents
> >> (first
> >> or last doc wins).
> >>
> >> Do any of these suggestions seem reasonable?. So far, I've been able to
> >> use
> >> Lucene without any modifications, and hope to continue this way.
> >>
> >> Peter
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_51513_4507029.1175213396932--