Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 16447 invoked from network); 30 Mar 2007 00:10:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Mar 2007 00:10:25 -0000 Received: (qmail 65521 invoked by uid 500); 30 Mar 2007 00:10:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 65491 invoked by uid 500); 30 Mar 2007 00:10:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 65480 invoked by uid 99); 30 Mar 2007 00:10:26 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2007 17:10:26 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of peterlkeegan@gmail.com designates 64.233.162.232 as permitted sender) Received: from [64.233.162.232] (HELO nz-out-0506.google.com) (64.233.162.232) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2007 17:10:17 -0700 Received: by nz-out-0506.google.com with SMTP id i1so266055nzh for ; Thu, 29 Mar 2007 17:09:57 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=gBeRLfYElhTD9XHH5RePIqIQPXlc808y20jSYF4+q5jWGEfw6lOSynzX7fx6BrYFCWOcf97aSbRxn8tAsjkw387ndmwAtznOLYRnGAkRn7Qi3OstGzTEGnIZfbyYTEhxPk6GpwASbolUSkPFWgiZEWdhEVnDbBYiWQL953knT34= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=IZDkKQK57eY+45FMsiFt0/itoV0rzgKe7J3BvwHrKi7TTA6xdyaQfHhwXw7A9z+2H4Lji1w/9i9SvL7G6o12CG8EvQ9QGEWFcaDIjl2SQjECunugBvZ6k3LaDnq4YBy570DHRlkuiws958evuxRByoJT+7fzHaepPMjpsExqED0= Received: by 10.65.242.10 with SMTP id u10mr2654281qbr.1175213396995; Thu, 29 Mar 2007 17:09:56 -0700 (PDT) Received: by 10.65.150.17 with HTTP; Thu, 29 Mar 2007 17:09:56 -0700 (PDT) Message-ID: Date: Thu, 29 Mar 2007 20:09:56 -0400 From: "Peter Keegan" To: java-user@lucene.apache.org Subject: Re: FieldSortedHitQueue enhancement In-Reply-To: <460C3890.7010004@teamware.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_51513_4507029.1175213396932" References: <20070329144800.4892.qmail@web50312.mail.re2.yahoo.com> <460C3890.7010004@teamware.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_51513_4507029.1175213396932 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline >Peter, how did you achieve 'last wins' as you must presumably remove first from the PQ? I implemented 'first wins' because the score is less important than other fields (distance, in our case), but you make a good point since score may be more important. How did you implement remove()? Peter On 3/29/07, Antony Bowesman wrote: > > I've got a similar duplicate case, but my duplicates are based on an > external ID > rather than Doc id so occurs for a single Query. It's using a custom > HitCollector but score based, not field sorted. > > If my duplicate contains a higher score than one on the PQ I need to > update the > stored score with the higher one, so PQ needs a replace() method where the > stored object.equals() can be used to find the object to delete. I'm not > sure > if there's a way to find the object efficiently in this case other than a > linear > search. I implemented remove(). > > Peter, how did you achieve 'last wins' as you must presumably remove first > from > the PQ? > > Antony > > > Peter Keegan wrote: > > The duplicate check would just be on the doc ID. I'm using TreeSet to > > detect > > duplicates with no noticeable affect on performance. The PQ only has to > be > > checked for a previous value IFF the element about to be inserted is > > actually inserted and not dropped because it's less than the least value > > already in there. So, the TreeSet is never bigger than the size of the > PQ > > (typically 25 to a few hundred items), not the size of all hits. > > > > Peter > > > > On 3/29/07, Otis Gospodnetic wrote: > >> > >> Hm, removing duplicates (as determined by a value of a specified > document > >> field) from the results would be nice. > >> How would your addition affect performance, considering it has to check > >> the PQ for a previous value for every candidate hit? > >> > >> Otis > >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > >> Simpy -- http://www.simpy.com/ - Tag - Search - Share > >> > >> ----- Original Message ---- > >> From: Peter Keegan > >> To: java-user@lucene.apache.org > >> Sent: Thursday, March 29, 2007 9:39:13 AM > >> Subject: FieldSortedHitQueue enhancement > >> > >> This is request for an enhancement to FieldSortedHitQueue/PriorityQueue > >> that > >> would prevent duplicate documents from being inserted, or > alternatively, > >> allow the application to prevent this (reason explained below). I can > do > >> this today by making the 'lessThan' method public and checking the > queue > >> before inserting like this: > >> > >> if (hq.size() < maxSize) { > >> // doc will be inserted into queue - check for duplicate before > >> inserting > >> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc, > >> (ScoreDoc)hq.top()) { > >> // doc will be inserted into queue - check for duplicate before > >> inserting > >> } else { > >> // doc will not be inserted - no check needed > >> } > >> > >> However, this is just replicating existing code in > >> PriorityQueue->insert(). > >> An alternative would be to have a method like: > >> > >> public boolean wouldBeInserted(ScoreDoc doc) > >> // returns true if doc would be inserted, without inserting > >> > >> The reason for this is that I have some queries that get expanded into > >> multiple searches and the resulting hits are OR'd together. The queries > >> contain 'terms' that are not seen by Lucene but are handled by a > >> HitCollector that uses external data for each document to evaluate > hits. > >> The > >> results from the priority queue should contain no duplicate documents > >> (first > >> or last doc wins). > >> > >> Do any of these suggestions seem reasonable?. So far, I've been able to > >> use > >> Lucene without any modifications, and hope to continue this way. > >> > >> Peter > >> > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_51513_4507029.1175213396932--