Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of oren.liat@gmail.com
 designates 209.85.219.179 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=qMA1rHc33z+KEeWjeJ80q9k6+3i647MoMXCffwV30IlZFn+vhY5hTxxy30Piy83nck
         7ARNUxT82oZbWcw5xJWrx8ZMByvJi/3G4omRvRfs6NlVZ+o1x+pxX2Cr9FZXLYYYcA++
         gK5/nLvuWQ8/eH8HL6oIfOY/gsT6U12WMtMFM=
MIME-Version: 1.0
In-Reply-To: <359a92830905180607k584bfc48p420772fbb5754d3b@mail.gmail.com>
References: <d36c28c30905140203w356f875et30ee637cd0f99015@mail.gmail.com>
	 <359a92830905140535y3425c344jdfdd8306be5fd54c@mail.gmail.com>
	 <d36c28c30905140616w3d7dc48fv6c3846766571fa89@mail.gmail.com>
	 <359a92830905140734s71f7943fuc237b02da9995c01@mail.gmail.com>
	 <d36c28c30905170057l4cde1221p7492fb90f1932bc6@mail.gmail.com>
	 <359a92830905171400y10484681p2b2325d2c19522ed@mail.gmail.com>
	 <d36c28c30905172355h3dcd143bt714b2220ff2ce3bf@mail.gmail.com>
	 <359a92830905180607k584bfc48p420772fbb5754d3b@mail.gmail.com>
Date: Wed, 20 May 2009 11:40:21 +0300
Message-ID: <d36c28c30905200140t787369b5m8d58a7bfcc669567@mail.gmail.com>
Subject: Re: Getting a score of a specific document
From: liat oren <oren.liat@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0015174bdc704f8283046a53fa03

--0015174bdc704f8283046a53fa03
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Ok, I understand. I will use the HitColector.
Thanks a lot for all the explanations!
Best,
Liat

2009/5/18 Erick Erickson <erickerickson@gmail.com>

> As best I understand it, you DO NOT WANT A FILTER. Filters do notcontribute
> to scoring, therefore do not rank your documents. If you use
> a filter, the most irrelevant document could be first. You want to use
> a HitCollector, see the link in my last e-mail. That link includes an
> example of using a bitset, which you can create pretty easily from your
> list of document IDs.
>
> Best
> Erick
>
> On Mon, May 18, 2009 at 2:55 AM, liat oren <oren.liat@gmail.com> wrote:
>
> > Sorry I didn't explain myself well.
> >
> > The problem I try to address is the following:
> > Think about the case where you have 100,000 documents indexed. Take word
> > 'a'
> > - if it appears in 80,000 documents, you want the score to take it into
> > account. You want only to see how 20,000 documents are close to a query,
> > and
> > only 10,000 of these contain the word 'a'.
> > 80,000 / 100,000 (the 'statistics' of the whole index) is much smaller
> than
> > 10,000 / 20,000 (the 'statistics of only the group of documents). So it
> > does
> > affect the score if I use the whole index or just the documents I am
> > interested in.
> > It might be that the order of these desired documents will not change,
> but
> > I
> > don;t see how you can assure it since the idf value can be really
> > different.
> >
> > So, I want the documents for my query to be
> > ranked *relative to each other*, AND NOT restricted to only the documents
> > I care about.
> > For that case, I need to use the filter, right?
> >
> > Its fine if I get the results in DocumentID - then I open these using
> > IndexReader to get the fields I need.
> >
> > Could you please give me an example of how I creat the Filter that
> filters
> > out a given list of ids?
> >
> > Thanks!
> > Liat
> > 2009/5/18 Erick Erickson <erickerickson@gmail.com>
> >
> > > I'm still unclear what you want the statistics *for*. "statistics"
> > > are pretty meaningless as far as I understand. The whole point
> > > of scoring is to use various "statistics" to *rank* documents *for
> > > a specific query*. You cannot, for instance, compare scores
> > > between different queries in any meaningful way.
> > >
> > > If you're saying that you want the documents for your query to be
> > > ranked *relative to each other*, but restricted to only the documents
> > > you care about, then I think you need a HitCollector
> > > because a Filter (last I knew) doesn't score documents therefore
> > > won't order them.
> > >
> > > But asking if the statistics reflect the whole index just isn't making
> > any
> > > sense to me. If you're asking that question I suspect that there's
> > > something about your problem space I don't understand and
> > > you're not explaining simply enough for me to grasp <G>.
> > >
> > > So forget a Filter because you'll get the documents back in
> > > (probably, but my memory is weak some days) document ID
> > > order. Implement a HitCollector whose collect method only
> > > sets bits for docs in your list. See:
> > >
> > >
> >
> http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/HitCollector.html#collect(int,%20float)
> > >
> > > Best
> > > Erick
> > >
> > >
> > > On Sun, May 17, 2009 at 3:57 AM, liat oren <oren.liat@gmail.com>
> wrote:
> > >
> > > > Yes, this is what I need - I don't need to get the scores for the
> > > documents
> > > > that were filtered.
> > > > The statistics I ment are idf(t) for example.
> > > > I want these to include the whole index of course.
> > > > It will include this info of all the index, right?
> > > >
> > > > if I have a list of ids that the query should look at, which Filter
> > > should
> > > > I
> > > > use?
> > > >
> > > > Thanks a lot,
> > > > Liat
> > > >
> > > > 2009/5/14 Erick Erickson <erickerickson@gmail.com>
> > > >
> > > > > Hmmm, come to think of it, if you pass the Filter to the search
> > > I*think*
> > > > > you
> > > > > don't get scores for that clause, but you may want to
> > > > > check it out...
> > > > >
> > > > > So I think you should think about implementing a HitCollector
> > > > > and collect only the documents you care about.
> > > > >
> > > > > This is really very little extra work since all the documents have
> > > > > to be evaluated anyway.
> > > > >
> > > > > I'm not sure what you mean by statistics for the whole index. I
> > suspect
> > > > > you're wondering if the scores reflect all the documents. But you
> > don't
> > > > > care because scores are not relevant between different queries, and
> > > > > if they are calculated only within the query you're running, all
> the
> > > > > documents returned have scores that rank them relative to each
> other.
> > > > >
> > > > > Best
> > > > > Erick
> > > > >
> > > > > On Thu, May 14, 2009 at 9:16 AM, liat oren <oren.liat@gmail.com>
> > > wrote:
> > > > >
> > > > > > Yes, I have a pre-defined list of documents that I care about.
> > > > > > Then I can do the search on these, but it will take the
> statictics
> > of
> > > > the
> > > > > > whole index, right?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2009/5/14 Erick Erickson <erickerickson@gmail.com>
> > > > > >
> > > > > > > I don't know if I'm understanding what you want, but if you
> havea
> > > > > > > pre-defined list of documents, couldn't you form a Filter? Then
> > > > > > > your results would only be the documents you care about.
> > > > > > >
> > > > > > > If this is irrelevant, perhaps you could explain a bit more
> about
> > > > > > > the problem you're trying to solve.
> > > > > > >
> > > > > > > Best
> > > > > > > Erick
> > > > > > >
> > > > > > > On Thu, May 14, 2009 at 5:03 AM, liat oren <
> oren.liat@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I have a big index and I want to get for a specific search
> only
> > > the
> > > > > > > grades
> > > > > > > > of a list of documents.
> > > > > > > > Is there a better way to get this score than looping on all
> the
> > > > > > reasults
> > > > > > > > set?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Liat
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

--0015174bdc704f8283046a53fa03--