Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 4508 invoked from network); 20 May 2009 08:40:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 May 2009 08:40:40 -0000 Received: (qmail 37432 invoked by uid 500); 20 May 2009 08:40:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 37383 invoked by uid 500); 20 May 2009 08:40:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 37373 invoked by uid 99); 20 May 2009 08:40:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 May 2009 08:40:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of oren.liat@gmail.com designates 209.85.219.179 as permitted sender) Received: from [209.85.219.179] (HELO mail-ew0-f179.google.com) (209.85.219.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 May 2009 08:40:42 +0000 Received: by ewy27 with SMTP id 27so361104ewy.5 for ; Wed, 20 May 2009 01:40:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=Lgnlus4ypj+ES+ElJTUSZlTz0Gvg6cBAqyQ3A4Fo7tI=; b=ltPiPhrf5D5Ba8Ogq+fNGyPlzx+LQWh4SbAb3ksFTqCt+CRPC9vJrt1cmTyg2naudy l2D1vJqZASrUjT5wo+202o8Oaos9r4tMuoFigHp+dDtRZnkiGe6NTqPt8GPvXaoV8bWR vRuz8kJAVpIxq/If9xZwh/yyP3hEXJmNBvDJg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=qMA1rHc33z+KEeWjeJ80q9k6+3i647MoMXCffwV30IlZFn+vhY5hTxxy30Piy83nck 7ARNUxT82oZbWcw5xJWrx8ZMByvJi/3G4omRvRfs6NlVZ+o1x+pxX2Cr9FZXLYYYcA++ gK5/nLvuWQ8/eH8HL6oIfOY/gsT6U12WMtMFM= MIME-Version: 1.0 Received: by 10.210.30.10 with SMTP id d10mr7103471ebd.92.1242808821186; Wed, 20 May 2009 01:40:21 -0700 (PDT) In-Reply-To: <359a92830905180607k584bfc48p420772fbb5754d3b@mail.gmail.com> References: <359a92830905140535y3425c344jdfdd8306be5fd54c@mail.gmail.com> <359a92830905140734s71f7943fuc237b02da9995c01@mail.gmail.com> <359a92830905171400y10484681p2b2325d2c19522ed@mail.gmail.com> <359a92830905180607k584bfc48p420772fbb5754d3b@mail.gmail.com> Date: Wed, 20 May 2009 11:40:21 +0300 Message-ID: Subject: Re: Getting a score of a specific document From: liat oren To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0015174bdc704f8283046a53fa03 X-Virus-Checked: Checked by ClamAV on apache.org --0015174bdc704f8283046a53fa03 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Ok, I understand. I will use the HitColector. Thanks a lot for all the explanations! Best, Liat 2009/5/18 Erick Erickson > As best I understand it, you DO NOT WANT A FILTER. Filters do notcontribute > to scoring, therefore do not rank your documents. If you use > a filter, the most irrelevant document could be first. You want to use > a HitCollector, see the link in my last e-mail. That link includes an > example of using a bitset, which you can create pretty easily from your > list of document IDs. > > Best > Erick > > On Mon, May 18, 2009 at 2:55 AM, liat oren wrote: > > > Sorry I didn't explain myself well. > > > > The problem I try to address is the following: > > Think about the case where you have 100,000 documents indexed. Take word > > 'a' > > - if it appears in 80,000 documents, you want the score to take it into > > account. You want only to see how 20,000 documents are close to a query, > > and > > only 10,000 of these contain the word 'a'. > > 80,000 / 100,000 (the 'statistics' of the whole index) is much smaller > than > > 10,000 / 20,000 (the 'statistics of only the group of documents). So it > > does > > affect the score if I use the whole index or just the documents I am > > interested in. > > It might be that the order of these desired documents will not change, > but > > I > > don;t see how you can assure it since the idf value can be really > > different. > > > > So, I want the documents for my query to be > > ranked *relative to each other*, AND NOT restricted to only the documents > > I care about. > > For that case, I need to use the filter, right? > > > > Its fine if I get the results in DocumentID - then I open these using > > IndexReader to get the fields I need. > > > > Could you please give me an example of how I creat the Filter that > filters > > out a given list of ids? > > > > Thanks! > > Liat > > 2009/5/18 Erick Erickson > > > > > I'm still unclear what you want the statistics *for*. "statistics" > > > are pretty meaningless as far as I understand. The whole point > > > of scoring is to use various "statistics" to *rank* documents *for > > > a specific query*. You cannot, for instance, compare scores > > > between different queries in any meaningful way. > > > > > > If you're saying that you want the documents for your query to be > > > ranked *relative to each other*, but restricted to only the documents > > > you care about, then I think you need a HitCollector > > > because a Filter (last I knew) doesn't score documents therefore > > > won't order them. > > > > > > But asking if the statistics reflect the whole index just isn't making > > any > > > sense to me. If you're asking that question I suspect that there's > > > something about your problem space I don't understand and > > > you're not explaining simply enough for me to grasp . > > > > > > So forget a Filter because you'll get the documents back in > > > (probably, but my memory is weak some days) document ID > > > order. Implement a HitCollector whose collect method only > > > sets bits for docs in your list. See: > > > > > > > > > http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/HitCollector.html#collect(int,%20float) > > > > > > Best > > > Erick > > > > > > > > > On Sun, May 17, 2009 at 3:57 AM, liat oren > wrote: > > > > > > > Yes, this is what I need - I don't need to get the scores for the > > > documents > > > > that were filtered. > > > > The statistics I ment are idf(t) for example. > > > > I want these to include the whole index of course. > > > > It will include this info of all the index, right? > > > > > > > > if I have a list of ids that the query should look at, which Filter > > > should > > > > I > > > > use? > > > > > > > > Thanks a lot, > > > > Liat > > > > > > > > 2009/5/14 Erick Erickson > > > > > > > > > Hmmm, come to think of it, if you pass the Filter to the search > > > I*think* > > > > > you > > > > > don't get scores for that clause, but you may want to > > > > > check it out... > > > > > > > > > > So I think you should think about implementing a HitCollector > > > > > and collect only the documents you care about. > > > > > > > > > > This is really very little extra work since all the documents have > > > > > to be evaluated anyway. > > > > > > > > > > I'm not sure what you mean by statistics for the whole index. I > > suspect > > > > > you're wondering if the scores reflect all the documents. But you > > don't > > > > > care because scores are not relevant between different queries, and > > > > > if they are calculated only within the query you're running, all > the > > > > > documents returned have scores that rank them relative to each > other. > > > > > > > > > > Best > > > > > Erick > > > > > > > > > > On Thu, May 14, 2009 at 9:16 AM, liat oren > > > wrote: > > > > > > > > > > > Yes, I have a pre-defined list of documents that I care about. > > > > > > Then I can do the search on these, but it will take the > statictics > > of > > > > the > > > > > > whole index, right? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/5/14 Erick Erickson > > > > > > > > > > > > > I don't know if I'm understanding what you want, but if you > havea > > > > > > > pre-defined list of documents, couldn't you form a Filter? Then > > > > > > > your results would only be the documents you care about. > > > > > > > > > > > > > > If this is irrelevant, perhaps you could explain a bit more > about > > > > > > > the problem you're trying to solve. > > > > > > > > > > > > > > Best > > > > > > > Erick > > > > > > > > > > > > > > On Thu, May 14, 2009 at 5:03 AM, liat oren < > oren.liat@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I have a big index and I want to get for a specific search > only > > > the > > > > > > > grades > > > > > > > > of a list of documents. > > > > > > > > Is there a better way to get this score than looping on all > the > > > > > > reasults > > > > > > > > set? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Liat > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --0015174bdc704f8283046a53fa03--