lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <>
Subject Re: Implementation of a ScoreObject ?
Date Wed, 27 Apr 2005 16:30:20 GMT
Robichaud, Jean-Philippe wrote:

>Probably the simplest/ideal schema of the ScoreObject would be something
>like a hashtable with Term being the keys and a TermScoreObject the value.
>The TermScoreObject would be filled at search time (if asked) and would
>contain all values used in the calculation of the "similarity score".  That
>way we could easily know what is the contribution of a specific term to the
>overall score.  

Some of us have talked about a score object in the past and agree that 
this would be a very good thing.  In addition to providing a sounder 
foundation for explanation, such a mechanism could help to provide 
better scoring.  For example, one limitation in Lucene now is that score 
normalization is ad hoc -- all scores are divided by the highest score 
IF the highest score is greater than 1, and whether or not the highest 
unnormalized score is greater that 1 is pretty much random.  This yields 
a situation where scores across multiple searches are not comparable 
(notwithstanding many applications that do compare them, getting random 
results).  With a score object, one would like to keep additional 
information, e.g., a count of boost-weighted query terms and the 
boost-weighted percentage of such terms that were matched by each 
result.  This could provide a more intrinsic normalization scheme, e.g., 
defining the highest score as the boost-weighted percentage of matched 
query terms and dividing all scores by the same constant to achieve 
this.  (Some additional refinements are necessary to handle things like 
MultiTermQuery's, which rewrite to BooleanQuery's with coord disabled -- 
such lists of alternate query terms should count as one term).

That is one addition example of something score objects could be used 
for.  A general mechanism should provide for easy extension such that 
different scoring classes could collect, record and aggregate different 
information for various purposes.

I've wanted to work on this for a while but haven't found the time.  I 
know Doug has had a score object mechanism on his radar screen (he first 
suggested this approach to me as a solution to the normalization issue 
I'm concerned about).  I expect he has a good approach in mind.  It 
would be great if you'd tackle this -- I'd be happy to help if that 
makes sense.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message