Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of karl.wettin@gmail.com
 designates 209.85.134.185 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:from:to:in-reply-to:content-type
         :content-transfer-encoding:mime-version:subject:date:references
         :x-mailer;
        b=axkrqOJiF6vcMKnXwCswcYg+bMhoCDdmyjbpvdFFRQhyk5V5Cl0qqXTgKijXPRJZYH
         lHEDeZnAOGGUyWnSl/q6mDdWOgAhhGmBUKwI1scByo9ZgdENaxn1uv/kQoLvsWWdy+wv
         LAP2EhVMJ4z9mz8ivIsF8JYExTgTen3jzKyZk=
Message-Id: <B25ADB11-E0D7-4AB9-96C5-465DC21C66B4@gmail.com>
From: Karl Wettin <karl.wettin@gmail.com>
To: java-user@lucene.apache.org
In-Reply-To: <7d6ffee60905062328o39d83609s3a761949b1d26e9f@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Subject: Re: interpreting scores
Date: Thu, 7 May 2009 16:50:13 +0200
References: <7d6ffee60905061757x258dbf58ic8c894334c2da416@mail.gmail.com>
  <867513fe0905062208m15e70ff3y1edb1cf775474a8b@mail.gmail.com>
 <7d6ffee60905062328o39d83609s3a761949b1d26e9f@mail.gmail.com>

Nate,

will there always be a correspodning mp3 for any given note sheet?


As for analysis, I'd try using ngrams of the complete untokenized file  
name if I was you.

"Michael Jackson Don't Stop 'till You Get Enough" ->
"^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so  
on.

See http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html


     karl

7 maj 2009 kl. 08.28 skrev Nate:

> Thanks Anshum.
>
> What happens if a search returns only one match, and that match is not
> very "good"? If scores are only comparable to the scores of other
> matches in the same search, then the score is effectively meaningless
> if there is only one match.
>
> It seems like a very common need to want to provide a "relevance"
> metric along with search results. I somewhat understand the
> complexities after reading this thread and the threads it links...
> http://www.gossamer-threads.com/lists/lucene/java-user/75002
> My case is slightly better since I don't care to show users the
> metric. My queries are simple term and boolean queries.
>
> This thread talks about "theoretical maximum score" but quickly loses
> me. Does this seem like the road to go down, given my needs?
> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075
>
> Say I do a search like:
> Michael Jackson Don't stop until you get enough
> And this is the top match:
> Michael Jackson Don't Stop 'till You Get Enough
> Would it make any sense to do a query with the exact contents of the
> top match to get a maximum score for that document? Would the
> resulting percentage be meaningful?
>
> -Nate
>
>
> On Wed, May 6, 2009 at 10:08 PM, Anshum <anshumg@gmail.com> wrote:
>> Hi Nate,
>> The scores are only comparable within the same search and not over  
>> different
>> searches as the scores are affected by query as well as docs.
>> About the threshold, I guess you could have count cutoff to get 'x'  
>> best
>> matches. Said so coz I'm not really able to recollect anything  
>> which could
>> use score as a metric to absolutely cluster 'good' and 'not good'  
>> matches.
>>
>> --
>> Anshum Gupta
>> Naukri Labs!
>> http://ai-cafe.blogspot.com
>>
>> The facts expressed here belong to everybody, the opinions to me. The
>> distinction is yours to draw............
>>
>>
>> On Thu, May 7, 2009 at 6:27 AM, Nate <nate@n4te.com> wrote:
>>
>>> Hi all,
>>>
>>> First, the problem I'm trying to solve: I have two folders, each
>>> containing files. I need to match files in one folder with files in
>>> the other. Eg:
>>>
>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
>>> songs/Michael Jackson Don't stop until you get enough.mp3
>>>
>>> I provide the notes files, but the song files come from a user's  
>>> music
>>> library, so often are not named well. I am attempting to use  
>>> Lucene to
>>> find the most likely note file for each song file.
>>>
>>> I index the note files, then I use the StandardAnalyzer with  
>>> carefully
>>> chosen stop words to search the index. The query uses each word in  
>>> the
>>> song file name (w/o extension) as a term. Fuzzy matching is used for
>>> words with > 4 characters, and the fuzzy percentage is set to be 1 /
>>> termlength. This works ok so far, though I would love to hear  
>>> opinions
>>> on any improvements I could make. This is my first use of Lucene, so
>>> I'm not sure I've chosen the best approach.
>>>
>>> The problem I'm having is: Sometimes there is a song file that has  
>>> no
>>> matching note file. In this case I get back results with "low"  
>>> scores,
>>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't
>>> really understand what the scoring means, so I don't know what would
>>> be a reasonable threshold to ignore scores.
>>>
>>> I understand scores are not relevance percentages. I think the  
>>> scores
>>> are only useful relative to other scores. Is this right? Are they  
>>> only
>>> relative to scores from the same search, or from any search against
>>> the same index? How can I know if a score is "low", so I can ignore
>>> matches that aren't very good?
>>>
>>> Sorry if this has been discussed before. I have searched around a
>>> great deal and was unable to find a straight answer.
>>>
>>> Thanks!
>>> -Nate
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org