lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: interpreting scores
Date Fri, 08 May 2009 17:08:42 GMT

8 maj 2009 kl. 13.13 skrev Nate:

> Is it possible to get a count for how many terms a result matched?

Currently I think you can only do that by using Searcher.explain().  
But that is not a very nice solution. A better solution is beeing  
worked on and might be available in a few months or so.


    karl

>
> Googling, it doesn't appear to be done easily. I tried it out by
> breaking my query into words myself, then doing a search for each one
> and keeping track of the results and counts. This way I know if 4 out
> of 5 terms matched a document, it is probably a pretty good match. If
> 1 out of 5 matched then it probably isn't a great match.
>
> 1) Is this approach reasonable?
> 2) What, if anything, do I lose by doing it this way?
> 3) How could I incorporate ngrams?
>
> Thanks!
> -Nate
>
>
> On Thu, May 7, 2009 at 9:57 PM, Nate <nate@n4te.com> wrote:
>> Hi Karl,
>>
>> No, sometimes there will not be a matching MP3 for a note file. When
>> this happens, the results I get are very poor. For example, if a song
>> with a common song word like "love" in the name does not have a
>> matching note file, then I get a handful of results that contain the
>> word "love" but are otherwise obviously not a good match. I need some
>> way to judge the quality of the matches, or possible some other
>> approach to doing the search that helps avoid false positives.
>>
>> On your clue, I have been reading about ngrams. Very interesting! I
>> see it is very useful for spell checking. However, how would I
>> leverage ngrams for my needs? Would the Lucene SpellChecker classes  
>> be
>> of any use?
>>
>> I really feel like I'm floundering here. I am more than willing to  
>> put
>> in the work, I just need a push or two in the right directions. :)
>>
>> Thanks!
>> -Nate
>>
>>
>> On Thu, May 7, 2009 at 7:50 AM, Karl Wettin <karl.wettin@gmail.com>  
>> wrote:
>>> Nate,
>>>
>>> will there always be a correspodning mp3 for any given note sheet?
>>>
>>>
>>> As for analysis, I'd try using ngrams of the complete untokenized  
>>> file name
>>> if I was you.
>>>
>>> "Michael Jackson Don't Stop 'till You Get Enough" ->
>>> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja",  
>>> and so on.
>>>
>>> See
>>> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html
>>>
>>>
>>>    karl
>>>
>>> 7 maj 2009 kl. 08.28 skrev Nate:
>>>
>>>> Thanks Anshum.
>>>>
>>>> What happens if a search returns only one match, and that match  
>>>> is not
>>>> very "good"? If scores are only comparable to the scores of other
>>>> matches in the same search, then the score is effectively  
>>>> meaningless
>>>> if there is only one match.
>>>>
>>>> It seems like a very common need to want to provide a "relevance"
>>>> metric along with search results. I somewhat understand the
>>>> complexities after reading this thread and the threads it links...
>>>> http://www.gossamer-threads.com/lists/lucene/java-user/75002
>>>> My case is slightly better since I don't care to show users the
>>>> metric. My queries are simple term and boolean queries.
>>>>
>>>> This thread talks about "theoretical maximum score" but quickly  
>>>> loses
>>>> me. Does this seem like the road to go down, given my needs?
>>>> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075
>>>>
>>>> Say I do a search like:
>>>> Michael Jackson Don't stop until you get enough
>>>> And this is the top match:
>>>> Michael Jackson Don't Stop 'till You Get Enough
>>>> Would it make any sense to do a query with the exact contents of  
>>>> the
>>>> top match to get a maximum score for that document? Would the
>>>> resulting percentage be meaningful?
>>>>
>>>> -Nate
>>>>
>>>>
>>>> On Wed, May 6, 2009 at 10:08 PM, Anshum <anshumg@gmail.com> wrote:
>>>>>
>>>>> Hi Nate,
>>>>> The scores are only comparable within the same search and not over
>>>>> different
>>>>> searches as the scores are affected by query as well as docs.
>>>>> About the threshold, I guess you could have count cutoff to get  
>>>>> 'x' best
>>>>> matches. Said so coz I'm not really able to recollect anything  
>>>>> which
>>>>> could
>>>>> use score as a metric to absolutely cluster 'good' and 'not good'
>>>>> matches.
>>>>>
>>>>> --
>>>>> Anshum Gupta
>>>>> Naukri Labs!
>>>>> http://ai-cafe.blogspot.com
>>>>>
>>>>> The facts expressed here belong to everybody, the opinions to  
>>>>> me. The
>>>>> distinction is yours to draw............
>>>>>
>>>>>
>>>>> On Thu, May 7, 2009 at 6:27 AM, Nate <nate@n4te.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> First, the problem I'm trying to solve: I have two folders, each
>>>>>> containing files. I need to match files in one folder with  
>>>>>> files in
>>>>>> the other. Eg:
>>>>>>
>>>>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
>>>>>> songs/Michael Jackson Don't stop until you get enough.mp3
>>>>>>
>>>>>> I provide the notes files, but the song files come from a  
>>>>>> user's music
>>>>>> library, so often are not named well. I am attempting to use  
>>>>>> Lucene to
>>>>>> find the most likely note file for each song file.
>>>>>>
>>>>>> I index the note files, then I use the StandardAnalyzer with  
>>>>>> carefully
>>>>>> chosen stop words to search the index. The query uses each word 

>>>>>> in the
>>>>>> song file name (w/o extension) as a term. Fuzzy matching is  
>>>>>> used for
>>>>>> words with > 4 characters, and the fuzzy percentage is set to
 
>>>>>> be 1 /
>>>>>> termlength. This works ok so far, though I would love to hear  
>>>>>> opinions
>>>>>> on any improvements I could make. This is my first use of  
>>>>>> Lucene, so
>>>>>> I'm not sure I've chosen the best approach.
>>>>>>
>>>>>> The problem I'm having is: Sometimes there is a song file that  
>>>>>> has no
>>>>>> matching note file. In this case I get back results with "low"  
>>>>>> scores,
>>>>>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I  
>>>>>> don't
>>>>>> really understand what the scoring means, so I don't know what  
>>>>>> would
>>>>>> be a reasonable threshold to ignore scores.
>>>>>>
>>>>>> I understand scores are not relevance percentages. I think the  
>>>>>> scores
>>>>>> are only useful relative to other scores. Is this right? Are  
>>>>>> they only
>>>>>> relative to scores from the same search, or from any search  
>>>>>> against
>>>>>> the same index? How can I know if a score is "low", so I can  
>>>>>> ignore
>>>>>> matches that aren't very good?
>>>>>>
>>>>>> Sorry if this has been discussed before. I have searched around a
>>>>>> great deal and was unable to find a straight answer.
>>>>>>
>>>>>> Thanks!
>>>>>> -Nate
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message