lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nate <>
Subject Re: interpreting scores
Date Thu, 07 May 2009 06:28:36 GMT
Thanks Anshum.

What happens if a search returns only one match, and that match is not
very "good"? If scores are only comparable to the scores of other
matches in the same search, then the score is effectively meaningless
if there is only one match.

It seems like a very common need to want to provide a "relevance"
metric along with search results. I somewhat understand the
complexities after reading this thread and the threads it links...
My case is slightly better since I don't care to show users the
metric. My queries are simple term and boolean queries.

This thread talks about "theoretical maximum score" but quickly loses
me. Does this seem like the road to go down, given my needs?

Say I do a search like:
Michael Jackson Don't stop until you get enough
And this is the top match:
Michael Jackson Don't Stop 'till You Get Enough
Would it make any sense to do a query with the exact contents of the
top match to get a maximum score for that document? Would the
resulting percentage be meaningful?


On Wed, May 6, 2009 at 10:08 PM, Anshum <> wrote:
> Hi Nate,
> The scores are only comparable within the same search and not over different
> searches as the scores are affected by query as well as docs.
> About the threshold, I guess you could have count cutoff to get 'x' best
> matches. Said so coz I'm not really able to recollect anything which could
> use score as a metric to absolutely cluster 'good' and 'not good' matches.
> --
> Anshum Gupta
> Naukri Labs!
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
> On Thu, May 7, 2009 at 6:27 AM, Nate <> wrote:
>> Hi all,
>> First, the problem I'm trying to solve: I have two folders, each
>> containing files. I need to match files in one folder with files in
>> the other. Eg:
>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
>> songs/Michael Jackson Don't stop until you get enough.mp3
>> I provide the notes files, but the song files come from a user's music
>> library, so often are not named well. I am attempting to use Lucene to
>> find the most likely note file for each song file.
>> I index the note files, then I use the StandardAnalyzer with carefully
>> chosen stop words to search the index. The query uses each word in the
>> song file name (w/o extension) as a term. Fuzzy matching is used for
>> words with > 4 characters, and the fuzzy percentage is set to be 1 /
>> termlength. This works ok so far, though I would love to hear opinions
>> on any improvements I could make. This is my first use of Lucene, so
>> I'm not sure I've chosen the best approach.
>> The problem I'm having is: Sometimes there is a song file that has no
>> matching note file. In this case I get back results with "low" scores,
>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't
>> really understand what the scoring means, so I don't know what would
>> be a reasonable threshold to ignore scores.
>> I understand scores are not relevance percentages. I think the scores
>> are only useful relative to other scores. Is this right? Are they only
>> relative to scores from the same search, or from any search against
>> the same index? How can I know if a score is "low", so I can ignore
>> matches that aren't very good?
>> Sorry if this has been discussed before. I have searched around a
>> great deal and was unable to find a straight answer.
>> Thanks!
>> -Nate
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message