Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 52847 invoked from network); 30 Mar 2007 02:16:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Mar 2007 02:16:01 -0000 Received: (qmail 35247 invoked by uid 500); 30 Mar 2007 02:16:01 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 35216 invoked by uid 500); 30 Mar 2007 02:16:01 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 28304 invoked by uid 99); 30 Mar 2007 02:10:55 -0000 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of tom.zvents@gmail.com designates 209.85.132.243 as permitted sender) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=DDtG012MoG74HHL+7QP8fn0GZLtVIwyZgP2zikOGIfTL9WTMvJ+Y72C7bLsWBOUXgCHjFTqTDVm4GC9N6zo7FD4NktD+gLtpa3ItwLsEqGcSlFBy7BTleC/92u6CsW9QK2yc3p8LzbErMkvgU7woIhuLy0R2JHEGM7/eQatQKMI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=Bxva2/5nO7MKZGy+pBgOkWBf0efl2/UWcElVGh3wobOSvN2xCtbOoEfjdIHpm86pYm/cnvt+mNZ8XshbO31l4gmj1WHE03Qan2IjBfyHwxV/NPWILHrFhtN+UbVAiVzOjs/1ZkDuLqzOBySBZBDm7f18Rg+MFcZMA96Grm8hLSo= Message-ID: Date: Thu, 29 Mar 2007 19:10:26 -0700 From: "Tom Hill" Sender: tom.zvents@gmail.com To: java-user@lucene.apache.org Subject: Re: FieldSortedHitQueue enhancement In-Reply-To: <903039.7332.qm@web50307.mail.re2.yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_45469_3339029.1175220626454" References: <903039.7332.qm@web50307.mail.re2.yahoo.com> X-Google-Sender-Auth: 11868908896ffbe4 X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_45469_3339029.1175220626454 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline On 3/29/07, Otis Gospodnetic wrote: > > Hm, removing duplicates (as determined by a value of a specified document > field) from the results would be nice. > How would your addition affect performance, considering it has to check > the PQ for a previous value for every candidate hit? We're doing this in solr (slightly hacked up), based on a field. The penalty was fairly substantial, if you just look at the performance of the queue itself, but the overall performance isn't bad. (Probably because solr's fast, and caches nicely). We just build a FieldSortedHitQueue equivalent, based on the standard java.util.PriorityQueue, instead of org.apache.lucene.util.PriorityQueue. I'm sure it could be done in a more efficient fashion, but since performance has been acceptable, I haven't bothered to try to improve it. Tom On 3/29/07, Otis Gospodnetic wrote: > > Ah, I see. This is less attractive to me personally, but maybe it helps > others. One thing I don't understand is why/how you'd get duplicate > documents with the same doc ID in there. Isn't insert(FieldDoc fdoc) called > only once for each doc? > > Otis > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > ----- Original Message ---- > From: Peter Keegan > To: java-user@lucene.apache.org > Sent: Thursday, March 29, 2007 11:00:24 AM > Subject: Re: FieldSortedHitQueue enhancement > > The duplicate check would just be on the doc ID. I'm using TreeSet to > detect > duplicates with no noticeable affect on performance. The PQ only has to be > checked for a previous value IFF the element about to be inserted is > actually inserted and not dropped because it's less than the least value > already in there. So, the TreeSet is never bigger than the size of the PQ > (typically 25 to a few hundred items), not the size of all hits. > > Peter > > On 3/29/07, Otis Gospodnetic wrote: > > > > Hm, removing duplicates (as determined by a value of a specified > document > > field) from the results would be nice. > > How would your addition affect performance, considering it has to check > > the PQ for a previous value for every candidate hit? > > > > Otis > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > > ----- Original Message ---- > > From: Peter Keegan > > To: java-user@lucene.apache.org > > Sent: Thursday, March 29, 2007 9:39:13 AM > > Subject: FieldSortedHitQueue enhancement > > > > This is request for an enhancement to FieldSortedHitQueue/PriorityQueue > > that > > would prevent duplicate documents from being inserted, or alternatively, > > allow the application to prevent this (reason explained below). I can do > > this today by making the 'lessThan' method public and checking the queue > > before inserting like this: > > > > if (hq.size() < maxSize) { > > // doc will be inserted into queue - check for duplicate before > > inserting > > } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc, > > (ScoreDoc)hq.top()) { > > // doc will be inserted into queue - check for duplicate before > > inserting > > } else { > > // doc will not be inserted - no check needed > > } > > > > However, this is just replicating existing code in > > PriorityQueue->insert(). > > An alternative would be to have a method like: > > > > public boolean wouldBeInserted(ScoreDoc doc) > > // returns true if doc would be inserted, without inserting > > > > The reason for this is that I have some queries that get expanded into > > multiple searches and the resulting hits are OR'd together. The queries > > contain 'terms' that are not seen by Lucene but are handled by a > > HitCollector that uses external data for each document to evaluate hits. > > The > > results from the priority queue should contain no duplicate documents > > (first > > or last doc wins). > > > > Do any of these suggestions seem reasonable?. So far, I've been able to > > use > > Lucene without any modifications, and hope to continue this way. > > > > Peter > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_45469_3339029.1175220626454--