Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 34582 invoked from network); 23 Mar 2007 08:04:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 Mar 2007 08:04:53 -0000 Received: (qmail 10701 invoked by uid 500); 23 Mar 2007 08:04:52 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 10672 invoked by uid 500); 23 Mar 2007 08:04:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 10661 invoked by uid 99); 23 Mar 2007 08:04:51 -0000 X-ASF-Spam-Status: No, hits=2.1 required=10.0 tests=RCVD_IN_WHOIS_INVALID,SPF_HELO_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [212.226.92.15] (HELO monkey.teamware.com) (212.226.92.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Mar 2007 01:04:50 -0700 Received: from nimitz (nimitz.teamw.com [10.142.128.10]) by monkey.teamware.com (8.13.1/8.13.1) with ESMTP id l2N82DmO021144 for ; Fri, 23 Mar 2007 10:02:13 +0200 Received: from [10.142.3.10] ([10.142.3.10]) by nimitz with ESMTP id m3na1l5n; 23 Mar 2007 10:01:00 +0200 Message-ID: <46038975.8020204@teamware.com> Date: Fri, 23 Mar 2007 19:01:57 +1100 From: Antony Bowesman Organization: Teamware Group User-Agent: Thunderbird 1.5.0.10 (Windows/20070221) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Combining score from two or more hits References: <460219CC.6020808@teamware.com> <359a92830703220607u5cf8278cr857a050f27504414@mail.gmail.com> <4602DE2E.2020306@teamware.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (monkey.teamware.com [212.226.92.15]); Fri, 23 Mar 2007 10:02:13 +0200 (EET) X-TWG-MailScanner-Information: See www.mailscanner.info for information X-TWG-MailScanner: Found to be clean X-TWG-MailScanner-SpamCheck: not spam, SpamAssassin (score=0.001, required 5, autolearn=not spam, BAYES_50 0.00) X-MailScanner-From: adb@teamware.com X-Virus-Checked: Checked by ClamAV on apache.org Chris Hostetter wrote: > > if you are using a HitCollector, there any re-evaluation is going to > happen in your code using whatever mechanism you want -- once your collect > method is called on a docid, Lucene is done with that docid and no longer > cares about it ... it's only whatever storage you may be maintaining of > high scoring docs thta needs to know that you've decided the score has > changed. > > your big problem is going to be that you basically need to maintain a list > of *every* doc collected, if you don't know what the score of any of them > are until you've processed all the rest ... since docs are collected in > increasing order of docid, you might be able to make some optimizations > based on how big of a gap you've got between the doc you are currently > collecting and the last doc you've collected if you know that you're > always going to add docs that "relate" to eachother in sequential bundles > -- but this would be some very custom code depending on your use case. I only ever need to return a couple of ID fields per doc hit, so I load them with FieldCache when I start a new searcher. These IDs refer to unique objects elsewhere, but there can be one or more instances of the same Id in the index due to the way I've structured Documents. A Document = an attachment in the other system attached to the other system's object which can have 1...n attachments. My problem is I need to return only unique external Ids with some kind of combined score up to the requested maxHits from the client. Getting the unique Ids is no problem, but as you say I either have to store all hits and then sort them by score at the end once I know all unique docs, or do some clever stuff with some type of PriorityQueue that allows me to re-jig scores that already exist in the sorted queue. One idea your comments raise is the relationship of docids to the group of Documents added for the higher level object. All the Documents for the external object are added with a single writer at index time. Assuming that the Documents for a single external Id will either all exist or none, then will the doc ids always be sequential for ever for that external Id or will they 'reorganise' themselves? Thanks Antony --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org