Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 93019 invoked from network); 19 Jul 2007 15:31:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Jul 2007 15:31:44 -0000 Received: (qmail 60055 invoked by uid 500); 19 Jul 2007 15:31:12 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 60023 invoked by uid 500); 19 Jul 2007 15:31:12 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60012 invoked by uid 99); 19 Jul 2007 15:31:12 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Jul 2007 08:31:12 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of markrmiller@gmail.com designates 66.249.82.230 as permitted sender) Received: from [66.249.82.230] (HELO wx-out-0506.google.com) (66.249.82.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Jul 2007 08:31:10 -0700 Received: by wx-out-0506.google.com with SMTP id i28so474965wxd for ; Thu, 19 Jul 2007 08:30:49 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=YBS/yw02BLdjY4WUO8KjvFbp27upbu95pZ0hWSHDb4CVlm2GuKUnIXHLeBXGjhuS4pX4kGR6WInnS9zz+LAsLkxBExMBzNkjGYY4f7jly0QNewxRBocgsnEuJmCkm4JPEojy0Ujkt0HPl8ZUJpabUAkEPcQDhfoWuT9BayxCe50= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=kkMtmmEczOAG0B86djt3trBdZtZ5fmnjkxiYO3zgqn8AdJLrW/ac7xNokd9OsL2MHOYPM/pA4T8KADDla0w/CfGekR3UVVkk55sDy4UNbQ0wMqWQF0NGTOlGycxILDLrOxxHPfrVe0OVOmwphzKJDdKC5NvTmiaKKRPtprnMPWw= Received: by 10.90.63.16 with SMTP id l16mr2934652aga.1184859049177; Thu, 19 Jul 2007 08:30:49 -0700 (PDT) Received: from ?192.168.1.100? ( [216.66.114.204]) by mx.google.com with ESMTPS id 33sm2602461wra.2007.07.19.08.30.47 (version=SSLv3 cipher=RC4-MD5); Thu, 19 Jul 2007 08:30:48 -0700 (PDT) Message-ID: <469F8361.3050007@gmail.com> Date: Thu, 19 Jul 2007 11:29:37 -0400 From: Mark Miller User-Agent: Thunderbird 2.0.0.4 (Windows/20070604) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: distinct query how to??? References: <00f601c7ca0a$c018c140$0b0410ac@bhavin> <359a92830707190651p71b82349oae5e103f42ef1bf4@mail.gmail.com> <012101c7ca0f$56eea8b0$0b0410ac@bhavin> In-Reply-To: <012101c7ca0f$56eea8b0$0b0410ac@bhavin> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org You get non relevant results because normally a HitCollector will only collect documents with scores greater than 0. Hits normalizes raw scores like this: if (hitDocs.size() > min) { min = hitDocs.size(); } int n = min * 2; // double # retrieved TopDocs topDocs = (sort == null) ? searcher.search(weight, filter, n) : searcher.search(weight, filter, n, sort); length = topDocs.totalHits; ScoreDoc[] scoreDocs = topDocs.scoreDocs; float scoreNorm = 1.0f; if (length > 0 && topDocs.getMaxScore() > 1.0f) { scoreNorm = 1.0f / topDocs.getMaxScore(); } int end = scoreDocs.length < length ? scoreDocs.length : length; for (int i = hitDocs.size(); i < end; i++) { hitDocs.addElement(new HitDoc(scoreDocs[i].score * scoreNorm, scoreDocs[i].doc)); } - Mark Bhavin Pandya wrote: > Hi erick, > Thanks for your prompt reply... > > Let me explain what i m doing.... > > There is lucene query which returns relevant result when i am > searching through Hits object. > But when i m using same query using DocCollector ( I want this way > because want to remove duplicate records at search time ) > .. Its giving results which is not relevant although its printing > score in descending order. > > Here is what i am doing in DocCollector... > > /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// > > public void collect(int doc, float score) > { > > Document document = reader.document(doc); > String photoid = document.get("photoid"); > if (!uniquelist.contains(photoid)) > { > uniquelist.add(photoid); > hq.insert(new ScoreDoc(doc, score)); > minScore = ((ScoreDoc)hq.top()).score; // maintain minScore > } > } > > public TopDocs topDocs() { > > ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()]; > for (int i = hq.size()-1; i >= 0; i--) // put docs in array > scoreDocs[i] = (ScoreDoc)hq.pop(); > > float maxScore = (totalHits==0) > ? Float.NEGATIVE_INFINITY > : scoreDocs[0].score; > > return new TopDocs(totalHits, scoreDocs, maxScore); > } > > > public ArrayList getAllDocIds() > { > ArrayList docidlist = new ArrayList(); > ArrayList mainlist = new ArrayList(); > TopDocs tc = topDocs(); > ScoreDoc[] scoredoc = tc.scoreDocs; > > for (int i=0;i { > doclist.add(new Integer(scoredoc[i].doc).toString()); > } > return doclist; > } > /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// > > > Is this a proper way to find duplicate records ??? If yes please let > me know where i am wrong.. ??? > Note: In this case, i can not handle duplicate records at index time... > > Thanks. > Bhavin pandya > > > > > ----- Original Message ----- From: "Erick Erickson" > > To: ; "Bhavin Pandya" > Sent: Thursday, July 19, 2007 7:21 PM > Subject: Re: Where exact score is getting calculate? > > >> I don't think you can using a HitCollector. If you used a TopDocs >> instead, >> you have access to the maximum score and can normalize the >> scores to between 0 and 1, but I don't know if that suits your needs. >> >> Erick >> >> On 7/19/07, Bhavin Pandya wrote: >>> >>> Hi, >>> >>> The score i am getting in DocCollector is raw score... which is not >>> necessary between 0 and 1. >>> Where lucene exactly calculating the final score...? Or >>> what if i want final score in DocCollector ??? How to ??? >>> >>> Regards. >>> Bhavin pandya >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org