Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Date: Thu, 4 May 2006 14:44:40 -0700 (PDT)
From: Chris Hostetter <hossman_lucene@fucit.org>
To: java-user@lucene.apache.org
Subject: Re: Newbie questions re: scoring
In-Reply-To: 
 <D3FB08E688E4954EADE12CCFFA2F2548D65FA2@canat0411.ca.deloitte.com>
Message-ID: <Pine.LNX.4.58.0605041434570.17134@hal.rescomp.berkeley.edu>
References: <D3FB08E688E4954EADE12CCFFA2F2548D65FA2@canat0411.ca.deloitte.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII


: 1)  I create an index with one document with a searchable field of "All
: dogs are brown."  If I search on that index with a query of "All dogs
: are brown." I do not get a hit with score 1.0, but something low like
: 0.38.  I tried looking at the scoring algorithm and can't make heads or
: tails of it.  Can anybody explain it to me in simple terms?

I've been using Lucene for about 16 months now, and i've never found a
simple way to explain the scoring.  But a big factor that you need to
realize is there is a differnece between the "raw" score and the
normalized score.  if you use a HitCollector or TopDocs object you get the
raw scored -- which is uncosntrained.  if you use a Hits object then your
scores will be normalized so that *if* the highest scoring document has a
score above 1, then all scores will be divided by the highest score -- if
the highest score is less then one, nothing changes.

my best advice for understainding how scores are calculated, is to look
at the toString() of an Explanation object from searcher.explain() for a
bunch of queries on a bunch of documens you know match, and think about
how those explanations corrispond to the equation in the Similarity class
javadocs.

: 2)  I have an index of documents, then run a search against it.  I run
: through the list of hits, building a Vector of documents whose score is
: above a certain threshold.  If I run the program with a threshold of
: say, 0.15, I'll get a Vector of documents with scores >= 0.15 (as
: expected).  If I set the threshold higher (0.30, for example) and rerun
: the program, I see some of the same documents that I thought would have
: been trimmed off with the higher threshold.  With a threshold of 0.15
: they would score 0.17, and with a threshold of 0.30 they are scoring
: something like 0.33.  Can anybody explain this?  My trimming is coming
: post-index-searching, so this is pretty confusing.

you are doing this with the exact same index and Query each time?

1) that shouldn't happen .. can you email some code that demonstates this
problem (ideally code that builds a small index and then searches it and
shows the same document getting two different scores without the index
changing)

2) independent of the scores being different, it is not safe to try and
pick a score threshold, this is mentioned in the FAQ...

http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org