lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nate <>
Subject Re: interpreting scores
Date Sat, 09 May 2009 03:13:45 GMT
Wow Karl, thank you so much for writing this up! It was a great help!
I have the ngram tokenizing working as you described. Searches are
very good!

In order to verify the hits are of high quality, I use the
Smith-Waterman algorithm. Other approximate string comparisons I
evaluated didn't work well because the scores change with the string
length. Eg...

Song file: red hot chili pepper under bridge
Lucene match: red hot chili peppers under bridge
Song name of hit for quality verification: under bridge

The song file we searched with is compared with the song name of the
hit. Most algorithms give something like 0.6. I don't know which parts
of the song file are the artist, if any, but if it is there then
artists with longer names score lower with these algorithms. However,
Smith-Waterman gives 1.0 (perfect match). If bridge were misspelled,
it would still get a high score.

In most cases, if there is no good match, I can detect it and show no
results. There is one problem though, take this example where there is
no "good" match:

Song file: u2 beautiful day
Lucene match: christina aguilera beautiful
Song name of hit for quality verification: beautiful

When I plug "u2 beautiful day" and "beautiful" into Smith-Waterman, I
get 1.0 and so I allow this incorrect hit to be first place. I don't
think there is really a way around this problem though. I guess maybe
the artist name could be required to be somewhere in the file path.

Overall I am pretty happy with my matching! I still need to do more
testing on some friends' dirty (the organization, not the content ;))
music libraries, but from my initial testing I think it works very
nicely. Thanks again! Any further advice or comments is very much
welcomed. :)


On Fri, May 8, 2009 at 10:06 AM, Karl Wettin <> wrote:
> Ngrams can be use for lots of stuff. In your case it has nothing to do with
> spellchecking, it was the "until" vs. "'till" that made me think of them as
> they would allow you to get at least partial matching of the text. Also,
> ngrams gives you a bit of phrase functionallity.
> Create the grams by passing a token containing the complete text to
> NgramTokenFilter. Something like this:
> TokenStream ts = new SingleTokenTokenStream(new Token("Michael Jackson Don't
> Stop 'till You Get Enough"));
> ts = new LowerCaseFilter(ts);
> ts = new NgramTokenFilter(ts, 4, 4);
> document.add(new Field("ngrams", ts));
> At query time you create a query the same way, perhaps something like this:
> TokenStream ts = new SingleTokenTokenStream(new Token("Michael Jackson Don't
> Stop 'till You Get Enough"));
> ts = new LowerCaseFilter(ts);
> ts = new NgramTokenFilter(ts, 4, 4);
> BooleanQuery bq = new BooleanQuery();
> Token token;
> while ((token = Token()) != null) {
>  bq.add(new BooleanClause(new TermQuery("ngrams", token.text()),
> BooleanClause.SHOULD);
> }
> You might want to fiddle around with the gram sizes.
> In order to detect a match you might want to take a look at the distance
> between scores.
> 0.98 : Michael Jackson, Don't stop 'till you get enough
> 0.96 : Michael Jackson, Don't stop until you get enough
> 0.07 : Barry White, Can't get enough of your love babe
> 0.06 : Depeche Mode, Just can't get enough
> Manual inspection of a bunch of queries that match the correct document vs.
> queries that looks for something that does not exist in your index will give
> you an indication of what distance from the top score is intersting. It is
> very probable that if you search for something that does not exist then the
> top hits will all have very similar score, and if you search for something
> that does exist then there will be a rather large gap between the correct
> hit and the first non correct hit.
> In order to fortify the response with knowledge that it really was a good
> hit in the top you might want to use some edit distance measure such as
> Levenstein or perhaps even something like the Jaccard index.
> I've actually done something similar to this. It worked almost flawless but
> as my data required 100% certainty someone had to manually check the data to
> avoid errors. I think that if you can settle with something like 95%
> certainty then it should be possible to have this automated.
>   karl
> 8 maj 2009 kl. 06.57 skrev Nate:
>> Hi Karl,
>> No, sometimes there will not be a matching MP3 for a note file. When
>> this happens, the results I get are very poor. For example, if a song
>> with a common song word like "love" in the name does not have a
>> matching note file, then I get a handful of results that contain the
>> word "love" but are otherwise obviously not a good match. I need some
>> way to judge the quality of the matches, or possible some other
>> approach to doing the search that helps avoid false positives.
>> On your clue, I have been reading about ngrams. Very interesting! I
>> see it is very useful for spell checking. However, how would I
>> leverage ngrams for my needs? Would the Lucene SpellChecker classes be
>> of any use?
>> I really feel like I'm floundering here. I am more than willing to put
>> in the work, I just need a push or two in the right directions. :)
>> Thanks!
>> -Nate
>> On Thu, May 7, 2009 at 7:50 AM, Karl Wettin <> wrote:
>>> Nate,
>>> will there always be a correspodning mp3 for any given note sheet?
>>> As for analysis, I'd try using ngrams of the complete untokenized file
>>> name
>>> if I was you.
>>> "Michael Jackson Don't Stop 'till You Get Enough" ->
>>> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so
>>> on.
>>> See
>>>   karl
>>> 7 maj 2009 kl. 08.28 skrev Nate:
>>>> Thanks Anshum.
>>>> What happens if a search returns only one match, and that match is not
>>>> very "good"? If scores are only comparable to the scores of other
>>>> matches in the same search, then the score is effectively meaningless
>>>> if there is only one match.
>>>> It seems like a very common need to want to provide a "relevance"
>>>> metric along with search results. I somewhat understand the
>>>> complexities after reading this thread and the threads it links...
>>>> My case is slightly better since I don't care to show users the
>>>> metric. My queries are simple term and boolean queries.
>>>> This thread talks about "theoretical maximum score" but quickly loses
>>>> me. Does this seem like the road to go down, given my needs?
>>>> Say I do a search like:
>>>> Michael Jackson Don't stop until you get enough
>>>> And this is the top match:
>>>> Michael Jackson Don't Stop 'till You Get Enough
>>>> Would it make any sense to do a query with the exact contents of the
>>>> top match to get a maximum score for that document? Would the
>>>> resulting percentage be meaningful?
>>>> -Nate
>>>> On Wed, May 6, 2009 at 10:08 PM, Anshum <> wrote:
>>>>> Hi Nate,
>>>>> The scores are only comparable within the same search and not over
>>>>> different
>>>>> searches as the scores are affected by query as well as docs.
>>>>> About the threshold, I guess you could have count cutoff to get 'x'
>>>>> best
>>>>> matches. Said so coz I'm not really able to recollect anything which
>>>>> could
>>>>> use score as a metric to absolutely cluster 'good' and 'not good'
>>>>> matches.
>>>>> --
>>>>> Anshum Gupta
>>>>> Naukri Labs!
>>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>>> distinction is yours to draw............
>>>>> On Thu, May 7, 2009 at 6:27 AM, Nate <> wrote:
>>>>>> Hi all,
>>>>>> First, the problem I'm trying to solve: I have two folders, each
>>>>>> containing files. I need to match files in one folder with files
>>>>>> the other. Eg:
>>>>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
>>>>>> songs/Michael Jackson Don't stop until you get enough.mp3
>>>>>> I provide the notes files, but the song files come from a user's
>>>>>> library, so often are not named well. I am attempting to use Lucene
>>>>>> find the most likely note file for each song file.
>>>>>> I index the note files, then I use the StandardAnalyzer with carefully
>>>>>> chosen stop words to search the index. The query uses each word in
>>>>>> song file name (w/o extension) as a term. Fuzzy matching is used
>>>>>> words with > 4 characters, and the fuzzy percentage is set to
be 1 /
>>>>>> termlength. This works ok so far, though I would love to hear opinions
>>>>>> on any improvements I could make. This is my first use of Lucene,
>>>>>> I'm not sure I've chosen the best approach.
>>>>>> The problem I'm having is: Sometimes there is a song file that has
>>>>>> matching note file. In this case I get back results with "low" scores,
>>>>>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't
>>>>>> really understand what the scoring means, so I don't know what would
>>>>>> be a reasonable threshold to ignore scores.
>>>>>> I understand scores are not relevance percentages. I think the scores
>>>>>> are only useful relative to other scores. Is this right? Are they
>>>>>> relative to scores from the same search, or from any search against
>>>>>> the same index? How can I know if a score is "low", so I can ignore
>>>>>> matches that aren't very good?
>>>>>> Sorry if this has been discussed before. I have searched around a
>>>>>> great deal and was unable to find a straight answer.
>>>>>> Thanks!
>>>>>> -Nate
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
>>>>>> For additional commands, e-mail:
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message