Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Date: Fri, 30 Apr 2004 18:15:59 -0400 (EDT)
From: "Matthew W. Bilotti" <mbilotti@csail.mit.edu>
To: lucene-dev@jakarta.apache.org
Subject: Help with scoring, coordination factor?
Message-ID: <Pine.LNX.4.44.0404301801570.2308-100000@bahamut.csail.mit.edu>
Organization: Massachusetts Institute of Technology
X-GPG-PUBLIC_KEY: http://web.mit.edu/mbilotti/www/mbilotti_public_key.asc
X-GPG-FINGERPRING: C566 09E5 1594 BB63 2732  DBAA 3C93 F73F 7B7E 403D
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N
X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N


> In my case it works perfectly. As we generate multilingual and semantic
> expansions of the original words of a query, the coordination factor was
> giving lower score to words with a lot of semantic or morphologic 
> variants.
> 

For me, this has not worked.  I have defined a WordQuery class and used it 
to define my disjunctions, but I am finding that the documents I am 
interested in are still suffering rank penalties.

I wanted to try to understand how the scoring was working internally, so 
for each document in my Hits, I printed the score and an Explanation,
when quering on the original forms of each word only (no WordQueries 
used).

The first document returned had a score of 0.592 and an explanation of 
"0.0 = match required".  Can anyone tell me what this means?  The next 39 
documents retrieved have the same explanation, and steadily decreasing 
scores, which makes sense.  The 40th document retrieved, though, has a 
score of 1.0 and the explanation:

0.0 = fieldWeight(contents:invented in 0), product of:
  0.0 = tf(termFreq(contents:invented)=0)
  6.507968 = idf(docFreq=4189)
  0.0390625 = fieldNorm(field=contents, doc=0)

Can anyone help me understand why a document with score 1.0 is retrieved 
directly after a document with score 0.211?  I don't understand the 
explanation.  Why is the term frequency of "invented" 0?  It should be 3; 
I checked the document.  I tried to delve into the code to find out how to 
print all of the components of the score to the screen (especially coord, 
which I am interested in), but I couldn't figure out how to do it.

Any help or hints you can give me would be truly appreciated.

~ Matthew

-- 
matthew w. bilotti
computer science and artificial intelligence laboratory
massachusetts institute of technology


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org