Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 19010 invoked from network); 5 May 2007 14:19:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 May 2007 14:19:17 -0000 Received: (qmail 34288 invoked by uid 500); 5 May 2007 14:19:16 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 34251 invoked by uid 500); 5 May 2007 14:19:15 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 34240 invoked by uid 99); 5 May 2007 14:19:15 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 May 2007 07:19:15 -0700 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of deinspanjer@gmail.com designates 209.85.134.187 as permitted sender) Received: from [209.85.134.187] (HELO mu-out-0910.google.com) (209.85.134.187) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 May 2007 07:19:08 -0700 Received: by mu-out-0910.google.com with SMTP id i10so1213273mue for ; Sat, 05 May 2007 07:18:45 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=MP2SZgbu06JlZ8Zbg6evJZk1TVV59AF9pUgESkgsTpAwL8aAEYaWwUQOzdhCbXWlPFaFuZAJDr+MeB/HwIKEZM9gu/+iGWeQwCJ09Xv1M3xAx0yHPhAVGBJoprgpqrMudyXQEqpZnsn51i10O7SBGWdxmsv6hVSRxjFyaNlG4OM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Loac0qgqPHeQsbxko4b46iwfON1UHRE22pCBkfqlGgVDFU4Ob4NZ/VatWDF+b3c9pe7h8kgtTBI5dGhUxKmqq8xNnJuunsDz1cmAPCsH0mldK9/hmVC8rbbD7RIsU4ylKdI0Qw4DgD5jG31ZOWEAVbNeoMMT8102VslwMCyf7Sk= Received: by 10.82.126.5 with SMTP id y5mr790735buc.1178374724965; Sat, 05 May 2007 07:18:44 -0700 (PDT) Received: by 10.82.118.10 with HTTP; Sat, 5 May 2007 07:18:44 -0700 (PDT) Message-ID: Date: Sat, 5 May 2007 10:18:44 -0400 From: "Daniel Einspanjer" To: java-user@lucene.apache.org, solr-user@lucene.apache.org Subject: Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure? In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_67009_19515380.1178374724925" References: <0315F020-F9E4-41FB-AABA-E02D2B5D0649@apache.org> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_67009_19515380.1178374724925 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline On 4/11/07, Chris Hostetter wrote: > > > A custom Similaity class with simplified tf, idf, and queryNorm functions > might also help you get scores from the Explain method that are more > easily manageable since you'll have predictible query structures hard > coded into your application. > > ie: run the large query once, get the results back, and for each result > look at the explanation and pull out the individual pieces of hte > explanation and compare them with those of hte other matches to create > your own "normalization". Chuck Williams mentioned a proposal he had for normalization of scores that would give a constant score range that would allow comparison of scores. Chuck, did you ever write any code to that end or was it just algorithmic discussion? Here is the point I'm at now: I have my matching engine working. The fields to be indexed and the queries are defined by the user. Hoss, I'm not sure how that affects your idea of having a custom Similarity class since you mentioned that having predictable query structures was important... The user kicks off an indexing then defines the queries they want to try matching with. Here is an example of the query fragments I'm working with right now: year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 +(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 director_name_mv:${Director}~.7 For each item in the source feed, the variables are interpolated (the query term is transformed into a grouped term if there are multiple values for a variable). That query is then made to find the overall best match. I then determine the relevance for each query fragment. I haven't written any plugins for Lucene yet, so my current method of determining the relevance is by running each query fragment by itself then iterating through the results looking to see if the overall best match is in this result set. If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a configured fragment weight. Since the scores aren't normalized, I have no good way of determining a poor overall match from a really high quality one. The overall item could be the first item returned in each of the query fragments. Any help here would be very appreciated. Ideally, I'm hoping that maybe Chuck has a patch or plugin that I could use to normalize my scores such that I could let the user do a matching run, look at the results and determine what score threshold to set for subsequent runs. Thanks, Daniel ------=_Part_67009_19515380.1178374724925--