Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 55490 invoked from network); 8 May 2009 04:58:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 May 2009 04:58:28 -0000 Received: (qmail 42467 invoked by uid 500); 8 May 2009 04:58:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 42397 invoked by uid 500); 8 May 2009 04:58:25 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 42387 invoked by uid 99); 8 May 2009 04:58:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 May 2009 04:58:25 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.221.195] (HELO mail-qy0-f195.google.com) (209.85.221.195) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 May 2009 04:58:13 +0000 Received: by qyk33 with SMTP id 33so1984919qyk.29 for ; Thu, 07 May 2009 21:57:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.72.209 with SMTP id n17mr7031237vcj.44.1241758670224; Thu, 07 May 2009 21:57:50 -0700 (PDT) In-Reply-To: References: <7d6ffee60905061757x258dbf58ic8c894334c2da416@mail.gmail.com> <867513fe0905062208m15e70ff3y1edb1cf775474a8b@mail.gmail.com> <7d6ffee60905062328o39d83609s3a761949b1d26e9f@mail.gmail.com> From: Nate Date: Thu, 7 May 2009 21:57:30 -0700 Message-ID: <7d6ffee60905072157h2e45fa81q8ff9a2f86aaca165@mail.gmail.com> Subject: Re: interpreting scores To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Karl, No, sometimes there will not be a matching MP3 for a note file. When this happens, the results I get are very poor. For example, if a song with a common song word like "love" in the name does not have a matching note file, then I get a handful of results that contain the word "love" but are otherwise obviously not a good match. I need some way to judge the quality of the matches, or possible some other approach to doing the search that helps avoid false positives. On your clue, I have been reading about ngrams. Very interesting! I see it is very useful for spell checking. However, how would I leverage ngrams for my needs? Would the Lucene SpellChecker classes be of any use? I really feel like I'm floundering here. I am more than willing to put in the work, I just need a push or two in the right directions. :) Thanks! -Nate On Thu, May 7, 2009 at 7:50 AM, Karl Wettin wrote: > Nate, > > will there always be a correspodning mp3 for any given note sheet? > > > As for analysis, I'd try using ngrams of the complete untokenized file na= me > if I was you. > > "Michael Jackson Don't Stop 'till You Get Enough" -> > "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so on= . > > See > http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/= package-summary.html > > > =A0 =A0karl > > 7 maj 2009 kl. 08.28 skrev Nate: > >> Thanks Anshum. >> >> What happens if a search returns only one match, and that match is not >> very "good"? If scores are only comparable to the scores of other >> matches in the same search, then the score is effectively meaningless >> if there is only one match. >> >> It seems like a very common need to want to provide a "relevance" >> metric along with search results. I somewhat understand the >> complexities after reading this thread and the threads it links... >> http://www.gossamer-threads.com/lists/lucene/java-user/75002 >> My case is slightly better since I don't care to show users the >> metric. My queries are simple term and boolean queries. >> >> This thread talks about "theoretical maximum score" but quickly loses >> me. Does this seem like the road to go down, given my needs? >> http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075 >> >> Say I do a search like: >> Michael Jackson Don't stop until you get enough >> And this is the top match: >> Michael Jackson Don't Stop 'till You Get Enough >> Would it make any sense to do a query with the exact contents of the >> top match to get a maximum score for that document? Would the >> resulting percentage be meaningful? >> >> -Nate >> >> >> On Wed, May 6, 2009 at 10:08 PM, Anshum wrote: >>> >>> Hi Nate, >>> The scores are only comparable within the same search and not over >>> different >>> searches as the scores are affected by query as well as docs. >>> About the threshold, I guess you could have count cutoff to get 'x' bes= t >>> matches. Said so coz I'm not really able to recollect anything which >>> could >>> use score as a metric to absolutely cluster 'good' and 'not good' >>> matches. >>> >>> -- >>> Anshum Gupta >>> Naukri Labs! >>> http://ai-cafe.blogspot.com >>> >>> The facts expressed here belong to everybody, the opinions to me. The >>> distinction is yours to draw............ >>> >>> >>> On Thu, May 7, 2009 at 6:27 AM, Nate wrote: >>> >>>> Hi all, >>>> >>>> First, the problem I'm trying to solve: I have two folders, each >>>> containing files. I need to match files in one folder with files in >>>> the other. Eg: >>>> >>>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes >>>> songs/Michael Jackson Don't stop until you get enough.mp3 >>>> >>>> I provide the notes files, but the song files come from a user's music >>>> library, so often are not named well. I am attempting to use Lucene to >>>> find the most likely note file for each song file. >>>> >>>> I index the note files, then I use the StandardAnalyzer with carefully >>>> chosen stop words to search the index. The query uses each word in the >>>> song file name (w/o extension) as a term. Fuzzy matching is used for >>>> words with > 4 characters, and the fuzzy percentage is set to be 1 / >>>> termlength. This works ok so far, though I would love to hear opinions >>>> on any improvements I could make. This is my first use of Lucene, so >>>> I'm not sure I've chosen the best approach. >>>> >>>> The problem I'm having is: Sometimes there is a song file that has no >>>> matching note file. In this case I get back results with "low" scores, >>>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't >>>> really understand what the scoring means, so I don't know what would >>>> be a reasonable threshold to ignore scores. >>>> >>>> I understand scores are not relevance percentages. I think the scores >>>> are only useful relative to other scores. Is this right? Are they only >>>> relative to scores from the same search, or from any search against >>>> the same index? How can I know if a score is "low", so I can ignore >>>> matches that aren't very good? >>>> >>>> Sorry if this has been discussed before. I have searched around a >>>> great deal and was unable to find a straight answer. >>>> >>>> Thanks! >>>> -Nate >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org