lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jlman <>
Subject Re: How to het the score in percentage
Date Sat, 22 Aug 2009 13:40:11 GMT

hossman wrote:
> : here ie, in our existing system we are showing the search score in
> : percenetage but lucene provides the search score in numbers which is
> derived
> : from some internal logic. Can anybody give some tips for converting the
> : lucene score to percentage or is there any way to retrive the score as
> : percentage from lucene search. 
> there is an extremely important and fundemental question you have to 
> answer when you say you want "the score as a percentage" ... 
> 	A percentage of what exactly?
> score values are meaningful only for purposes of comparison between other 
> documents for the exact same query and the exact same index.  when you try 
> to compute a percentage, you are setting up an implicit comparison with 
> scores from other queries.
> -Hoss

There is one situation where comparison is viable. When the input is an
existing document (ie - using the mlt function or doing a simple query using
a document's title/body). In such cases, the score of the document to itself
(which will hopefully be the max score in the result set) is the scaling
factor. With this approach we can answer the question "are docs A and B more
similar than docs C and D".

This may even be the approach used by carrot for clustering, though I
haven't looked into how it generates its similarity matrix. (note - it's
also possible that the scores between two docs aren't bi-directional,
meaning A is more similar to B than B is to A)

Perhaps treating each query as a document would allow lucene to return the
max score possible for that query (the match to itself), and then scale
documents from there. Yes, there are lots of challenges to actually doing
this since you wouldn't want to actually add a temporary doc to the index.

I know this topic usually morphs into assessing if a percentage-match is
useful. While I agree that scaled/normalized scores are prone to misuse, we
need a way to know if there are any good results, not just what the best
results are. One use case is when users submit content similar to existing
content and you'd like to alert them to the near-duplicate before
proceeding. Obviously you only want to prompt them if there are close
matches, and currently lucene only offers a way to get the most similar
docs, not a way to determine if any are actually similar.
View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message