lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Lucene-java Wiki] Update of "ScoresAsPercentages" by HossMan
Date Tue, 30 Jun 2009 19:36:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by HossMan:

  scores from other queries.
+ In some other threads, the topic of computing a percentage from the maximum *possible* score
has been discussed, but an approach like this would pose additional problems...
+ {{{
+ : Is it possible to compute a theoretical maximum score for a given query if
+ : constraints are placed on 'tf' and 'lengthNorm'? If so, scores could be
+ : compared to a 'perfect score' (a feature request from our customers)
+ without thinking about it two hard, you'd also need to constrain:
+  * field and document boosts (which are combined with lengthNorm to create
+    the fieldNorm)
+  * query time boosts
+  * idf (there's no law that says Similarity.idf has to return a number
+    less then 1)
+  * queryNorm
+'d also need to use query structures that are simple -- a
+ ValueSourceQuery can break all the rules it wants -- but if you used only
+ basic types of queries (boolean, term, phrase) and you imposed those
+ constraints you could probably pull it off.
+ the key thing to watch out for is that even if you can do it, and you can
+ start to say meaninful things like "doc A scored X out of a max possible Y
+ against query Q" that doesn't neccessarily help you compare that with doc
+ B which scored V out of a max possible W against query R ... even if X/Y == V/W
+ ... because the *structure* and complexity of Q and R play havock
+ with the scores.
+ but comparisons like that are what people are going to start to do as
+ soome as you give them a number like that.  People are going to start to
+ tihnk "well doc A is an N% match for Q, and doc B is an N% match for R,
+ but A is clearly a better match Q then B is for R .. what the heck?"
+ ...i would do a lot of "subjective" testing using a variety of queries of
+ various complexities before i put any faith in producing a number like
+ that.
+ I suspect what you'd find is that the constraints you need to impose in
+ order to give you a meaningful number wind up hobbling Lucene so much it
+ doesn't do a very good job of scoring anything. 
+ }}}
+ As alluded to in that email, this still wouldn't make the scores comparable between queries
with different structures;  It also would suffer the same problem of percentages changing
when documents in the index are added/removed (because of idf changes) even if those documents
have nothing to do with the query.
  More background...
-  *

View raw message