Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 65183 invoked from network); 7 May 2009 14:50:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 May 2009 14:50:51 -0000 Received: (qmail 2224 invoked by uid 500); 7 May 2009 14:50:48 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 2170 invoked by uid 500); 7 May 2009 14:50:48 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 2160 invoked by uid 99); 7 May 2009 14:50:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 May 2009 14:50:48 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of karl.wettin@gmail.com designates 209.85.134.185 as permitted sender) Received: from [209.85.134.185] (HELO mu-out-0910.google.com) (209.85.134.185) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 May 2009 14:50:37 +0000 Received: by mu-out-0910.google.com with SMTP id i2so363705mue.5 for ; Thu, 07 May 2009 07:50:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=L4kksMxvp7qZGdh6amO2Hl+TiijVwT8RRjIKmEdfhsk=; b=lh9d9WEDZousF0fzsQ1U8nIZYZWGnn1DGEqp3zwg9+1hrWRxr5ICjdu5DGQBw9A6Ud hqreHkFSnJfCcVu22t/s4i/pOJiQwm21F1xMxlpKqzuNHtJsXWgruJOBWXNifi6jNgDF Qkw741rFlT4I/xlStxPNF7MyHCV6vXNBc4nO8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=axkrqOJiF6vcMKnXwCswcYg+bMhoCDdmyjbpvdFFRQhyk5V5Cl0qqXTgKijXPRJZYH lHEDeZnAOGGUyWnSl/q6mDdWOgAhhGmBUKwI1scByo9ZgdENaxn1uv/kQoLvsWWdy+wv LAP2EhVMJ4z9mz8ivIsF8JYExTgTen3jzKyZk= Received: by 10.103.174.16 with SMTP id b16mr1703709mup.28.1241707815283; Thu, 07 May 2009 07:50:15 -0700 (PDT) Received: from ?192.168.1.201? (c-c98770d5.029-18-6d6c6d2.cust.bredbandsbolaget.se [213.112.135.201]) by mx.google.com with ESMTPS id y6sm183890mug.40.2009.05.07.07.50.14 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 07 May 2009 07:50:14 -0700 (PDT) Message-Id: From: Karl Wettin To: java-user@lucene.apache.org In-Reply-To: <7d6ffee60905062328o39d83609s3a761949b1d26e9f@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: interpreting scores Date: Thu, 7 May 2009 16:50:13 +0200 References: <7d6ffee60905061757x258dbf58ic8c894334c2da416@mail.gmail.com> <867513fe0905062208m15e70ff3y1edb1cf775474a8b@mail.gmail.com> <7d6ffee60905062328o39d83609s3a761949b1d26e9f@mail.gmail.com> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org Nate, will there always be a correspodning mp3 for any given note sheet? As for analysis, I'd try using ngrams of the complete untokenized file name if I was you. "Michael Jackson Don't Stop 'till You Get Enough" -> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so on. See http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html karl 7 maj 2009 kl. 08.28 skrev Nate: > Thanks Anshum. > > What happens if a search returns only one match, and that match is not > very "good"? If scores are only comparable to the scores of other > matches in the same search, then the score is effectively meaningless > if there is only one match. > > It seems like a very common need to want to provide a "relevance" > metric along with search results. I somewhat understand the > complexities after reading this thread and the threads it links... > http://www.gossamer-threads.com/lists/lucene/java-user/75002 > My case is slightly better since I don't care to show users the > metric. My queries are simple term and boolean queries. > > This thread talks about "theoretical maximum score" but quickly loses > me. Does this seem like the road to go down, given my needs? > http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075 > > Say I do a search like: > Michael Jackson Don't stop until you get enough > And this is the top match: > Michael Jackson Don't Stop 'till You Get Enough > Would it make any sense to do a query with the exact contents of the > top match to get a maximum score for that document? Would the > resulting percentage be meaningful? > > -Nate > > > On Wed, May 6, 2009 at 10:08 PM, Anshum wrote: >> Hi Nate, >> The scores are only comparable within the same search and not over >> different >> searches as the scores are affected by query as well as docs. >> About the threshold, I guess you could have count cutoff to get 'x' >> best >> matches. Said so coz I'm not really able to recollect anything >> which could >> use score as a metric to absolutely cluster 'good' and 'not good' >> matches. >> >> -- >> Anshum Gupta >> Naukri Labs! >> http://ai-cafe.blogspot.com >> >> The facts expressed here belong to everybody, the opinions to me. The >> distinction is yours to draw............ >> >> >> On Thu, May 7, 2009 at 6:27 AM, Nate wrote: >> >>> Hi all, >>> >>> First, the problem I'm trying to solve: I have two folders, each >>> containing files. I need to match files in one folder with files in >>> the other. Eg: >>> >>> notes/Michael Jackson - Don't Stop 'till You Get Enough.notes >>> songs/Michael Jackson Don't stop until you get enough.mp3 >>> >>> I provide the notes files, but the song files come from a user's >>> music >>> library, so often are not named well. I am attempting to use >>> Lucene to >>> find the most likely note file for each song file. >>> >>> I index the note files, then I use the StandardAnalyzer with >>> carefully >>> chosen stop words to search the index. The query uses each word in >>> the >>> song file name (w/o extension) as a term. Fuzzy matching is used for >>> words with > 4 characters, and the fuzzy percentage is set to be 1 / >>> termlength. This works ok so far, though I would love to hear >>> opinions >>> on any improvements I could make. This is my first use of Lucene, so >>> I'm not sure I've chosen the best approach. >>> >>> The problem I'm having is: Sometimes there is a song file that has >>> no >>> matching note file. In this case I get back results with "low" >>> scores, >>> such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I don't >>> really understand what the scoring means, so I don't know what would >>> be a reasonable threshold to ignore scores. >>> >>> I understand scores are not relevance percentages. I think the >>> scores >>> are only useful relative to other scores. Is this right? Are they >>> only >>> relative to scores from the same search, or from any search against >>> the same index? How can I know if a score is "low", so I can ignore >>> matches that aren't very good? >>> >>> Sorry if this has been discussed before. I have searched around a >>> great deal and was unable to find a straight answer. >>> >>> Thanks! >>> -Nate >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org