Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 19946 invoked from network); 7 May 2009 05:09:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 May 2009 05:09:10 -0000 Received: (qmail 90804 invoked by uid 500); 7 May 2009 05:09:08 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 90724 invoked by uid 500); 7 May 2009 05:09:08 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 90714 invoked by uid 99); 7 May 2009 05:09:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 May 2009 05:09:08 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of anshumg@gmail.com designates 209.85.221.195 as permitted sender) Received: from [209.85.221.195] (HELO mail-qy0-f195.google.com) (209.85.221.195) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 May 2009 05:08:58 +0000 Received: by qyk33 with SMTP id 33so961602qyk.29 for ; Wed, 06 May 2009 22:08:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=fUmX9mPCgR1Tm5W50eVoTyZ85rAL0ux9PxHukXnZZ5M=; b=SpgPCATi8jOGWn7XadAGgugFyQ73iSwbNTuerKec8HepebyNek9iOUlXFER/0YWXCt rcLOoN/U+7UhtaRQXSeJd9syGg5VycTjYh9YJU12Y7hWDOEHvRXFn3fJHXnSsCuaXiCh GXziRNEgJeWJhxrT2hbmA8WVWHBIpUrTFEAvo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=d0by4mL2nwbj+enR6Qe2ztnM52DFAYSM8JACVNOua1WFQM1NMm40J0xQo+ouS3es33 u3UEKkWWc38LVCBqIsV0JEbjvu8EPE/R7bFJTnrA9Opa3tQ22XiSiCXyn3yYCThLfT6M EbCWsHe1m51RNOdLU7D7LGfdLmBHTIn4/HA0o= MIME-Version: 1.0 Received: by 10.229.100.5 with SMTP id w5mr2237194qcn.100.1241672917344; Wed, 06 May 2009 22:08:37 -0700 (PDT) In-Reply-To: <7d6ffee60905061757x258dbf58ic8c894334c2da416@mail.gmail.com> References: <7d6ffee60905061757x258dbf58ic8c894334c2da416@mail.gmail.com> Date: Thu, 7 May 2009 10:38:37 +0530 Message-ID: <867513fe0905062208m15e70ff3y1edb1cf775474a8b@mail.gmail.com> Subject: Re: interpreting scores From: Anshum To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016364eeff02a62b504694b81f4 X-Virus-Checked: Checked by ClamAV on apache.org --0016364eeff02a62b504694b81f4 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi Nate, The scores are only comparable within the same search and not over different searches as the scores are affected by query as well as docs. About the threshold, I guess you could have count cutoff to get 'x' best matches. Said so coz I'm not really able to recollect anything which could use score as a metric to absolutely cluster 'good' and 'not good' matches. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw............ On Thu, May 7, 2009 at 6:27 AM, Nate wrote: > Hi all, > > First, the problem I'm trying to solve: I have two folders, each > containing files. I need to match files in one folder with files in > the other. Eg: > > notes/Michael Jackson - Don't Stop 'till You Get Enough.notes > songs/Michael Jackson Don't stop until you get enough.mp3 > > I provide the notes files, but the song files come from a user's music > library, so often are not named well. I am attempting to use Lucene to > find the most likely note file for each song file. > > I index the note files, then I use the StandardAnalyzer with carefully > chosen stop words to search the index. The query uses each word in the > song file name (w/o extension) as a term. Fuzzy matching is used for > words with > 4 characters, and the fuzzy percentage is set to be 1 / > termlength. This works ok so far, though I would love to hear opinions > on any improvements I could make. This is my first use of Lucene, so > I'm not sure I've chosen the best approach. > > The problem I'm having is: Sometimes there is a song file that has no > matching note file. In this case I get back results with "low" scores, > such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't > really understand what the scoring means, so I don't know what would > be a reasonable threshold to ignore scores. > > I understand scores are not relevance percentages. I think the scores > are only useful relative to other scores. Is this right? Are they only > relative to scores from the same search, or from any search against > the same index? How can I know if a score is "low", so I can ignore > matches that aren't very good? > > Sorry if this has been discussed before. I have searched around a > great deal and was unable to find a straight answer. > > Thanks! > -Nate > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --0016364eeff02a62b504694b81f4--