Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 81476 invoked from network); 29 Mar 2007 22:08:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Mar 2007 22:08:03 -0000 Received: (qmail 60747 invoked by uid 500); 29 Mar 2007 22:08:03 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 60713 invoked by uid 500); 29 Mar 2007 22:08:03 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60702 invoked by uid 99); 29 Mar 2007 22:08:03 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2007 15:08:03 -0700 X-ASF-Spam-Status: No, hits=2.1 required=10.0 tests=RCVD_IN_WHOIS_INVALID,SPF_HELO_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [212.226.92.15] (HELO monkey.teamware.com) (212.226.92.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2007 15:07:55 -0700 Received: from nimitz (nimitz.teamw.com [10.142.128.10]) by monkey.teamware.com (8.13.1/8.13.1) with ESMTP id l2TM7RSM017720 for ; Fri, 30 Mar 2007 01:07:28 +0300 Received: from [10.142.3.12] ([10.142.3.12]) by nimitz with ESMTP id m3u179rk; 30 Mar 2007 01:07:00 +0300 Message-ID: <460C3890.7010004@teamware.com> Date: Fri, 30 Mar 2007 08:07:12 +1000 From: Antony Bowesman Organization: Teamware Group User-Agent: Thunderbird 1.5.0.10 (Windows/20070221) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: FieldSortedHitQueue enhancement References: <20070329144800.4892.qmail@web50312.mail.re2.yahoo.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (monkey.teamware.com [212.226.92.15]); Fri, 30 Mar 2007 01:07:28 +0300 (EEST) X-TWG-MailScanner-Information: See www.mailscanner.info for information X-TWG-MailScanner: Found to be clean X-TWG-MailScanner-SpamCheck: not spam, SpamAssassin (score=0.001, required 5, autolearn=not spam, BAYES_50 0.00) X-MailScanner-From: adb@teamware.com X-Virus-Checked: Checked by ClamAV on apache.org I've got a similar duplicate case, but my duplicates are based on an external ID rather than Doc id so occurs for a single Query. It's using a custom HitCollector but score based, not field sorted. If my duplicate contains a higher score than one on the PQ I need to update the stored score with the higher one, so PQ needs a replace() method where the stored object.equals() can be used to find the object to delete. I'm not sure if there's a way to find the object efficiently in this case other than a linear search. I implemented remove(). Peter, how did you achieve 'last wins' as you must presumably remove first from the PQ? Antony Peter Keegan wrote: > The duplicate check would just be on the doc ID. I'm using TreeSet to > detect > duplicates with no noticeable affect on performance. The PQ only has to be > checked for a previous value IFF the element about to be inserted is > actually inserted and not dropped because it's less than the least value > already in there. So, the TreeSet is never bigger than the size of the PQ > (typically 25 to a few hundred items), not the size of all hits. > > Peter > > On 3/29/07, Otis Gospodnetic wrote: >> >> Hm, removing duplicates (as determined by a value of a specified document >> field) from the results would be nice. >> How would your addition affect performance, considering it has to check >> the PQ for a previous value for every candidate hit? >> >> Otis >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >> Simpy -- http://www.simpy.com/ - Tag - Search - Share >> >> ----- Original Message ---- >> From: Peter Keegan >> To: java-user@lucene.apache.org >> Sent: Thursday, March 29, 2007 9:39:13 AM >> Subject: FieldSortedHitQueue enhancement >> >> This is request for an enhancement to FieldSortedHitQueue/PriorityQueue >> that >> would prevent duplicate documents from being inserted, or alternatively, >> allow the application to prevent this (reason explained below). I can do >> this today by making the 'lessThan' method public and checking the queue >> before inserting like this: >> >> if (hq.size() < maxSize) { >> // doc will be inserted into queue - check for duplicate before >> inserting >> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc, >> (ScoreDoc)hq.top()) { >> // doc will be inserted into queue - check for duplicate before >> inserting >> } else { >> // doc will not be inserted - no check needed >> } >> >> However, this is just replicating existing code in >> PriorityQueue->insert(). >> An alternative would be to have a method like: >> >> public boolean wouldBeInserted(ScoreDoc doc) >> // returns true if doc would be inserted, without inserting >> >> The reason for this is that I have some queries that get expanded into >> multiple searches and the resulting hits are OR'd together. The queries >> contain 'terms' that are not seen by Lucene but are handled by a >> HitCollector that uses external data for each document to evaluate hits. >> The >> results from the priority queue should contain no duplicate documents >> (first >> or last doc wins). >> >> Do any of these suggestions seem reasonable?. So far, I've been able to >> use >> Lucene without any modifications, and hope to continue this way. >> >> Peter >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org