Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 60179 invoked from network); 13 Aug 2007 18:41:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Aug 2007 18:41:12 -0000 Received: (qmail 34786 invoked by uid 500); 13 Aug 2007 18:41:05 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 34449 invoked by uid 500); 13 Aug 2007 18:41:04 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 34420 invoked by uid 99); 13 Aug 2007 18:41:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2007 11:41:04 -0700 X-ASF-Spam-Status: No, hits=-2.0 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gresh@us.ibm.com designates 32.97.182.141 as permitted sender) Received: from [32.97.182.141] (HELO e1.ny.us.ibm.com) (32.97.182.141) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2007 18:41:09 +0000 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l7DIeah4018943 for ; Mon, 13 Aug 2007 14:40:36 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.4) with ESMTP id l7DIeaJa558066 for ; Mon, 13 Aug 2007 14:40:36 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l7DIeaMa012535 for ; Mon, 13 Aug 2007 14:40:36 -0400 Received: from d01ml605.pok.ibm.com (d01ml605.pok.ibm.com [9.56.227.91]) by d01av01.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l7DIeas4012529 for ; Mon, 13 Aug 2007 14:40:36 -0400 In-Reply-To: To: java-user@lucene.apache.org MIME-Version: 1.0 Subject: Re: How to implement cut of score ? X-Mailer: Lotus Notes Release 7.0 HF277 June 21, 2006 From: Donna L Gresh Message-ID: Date: Mon, 13 Aug 2007 14:40:37 -0400 X-MIMETrack: Serialize by Router on D01ML605/01/M/IBM(Release 8.0|August 02, 2007) at 08/13/2007 14:40:35, Serialize complete at 08/13/2007 14:40:35 Content-Type: multipart/alternative; boundary="=_alternative 006696D685257336_=" X-Virus-Checked: Checked by ClamAV on apache.org --=_alternative 006696D685257336_= Content-Type: text/plain; charset="US-ASCII" Hoss wrote: this would be meaningless even if it were easier... http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03 FAQ: "Can I filter by score?" -Hoss I've read the warnings referenced there; but still have a problem to solve. We have "fact-based" information about people and the jobs they might fill (availability dates, experience level, languages spoken, etc.) and we have textual information about both the jobs and the people (e.g. resumes). We'd like to use the "goodness of match" of the textual description of the job to the person's resume as a way to suggest, for example, additional people who should be considered for the job, even if, say, their specific job title does not match the requested job title. I can use the job description to construct a query (and I've done it in a variety of ways), but how best to choose which of the returned people to allow to "fit" the job? An obviously desirable way to do it is using the score, but all discussion seems to say "don't do that, since absolute score isn't meaningful" (and I don't use the normalized score, BTW, I use the raw score, but the same caveats apply). Certainly the relative scores for a single query can be used to rank the goodness of fit to that particular job, but that doesn't solve the problem in general. Should I just give up and return the "25 best" fits to the job, and only use score to rank them relative to one another? This then means that a job description that has very few words that match *anything* in the collection of resumes will still produce 25 people that "match". In practice (anecdotally) it does appear to me that when the highest score for a particular job description is fairly small (say 0.10) that there's a good reason for that, and when the highest score is something like 0.60, there's a good reason for that as well. That is, queries that yield a small "best score" are queries for which I would not expect good matches, and vice versa. So it does seem (again anecdotally) that the score has *some* relevance. What are the experts' thoughts on this? Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh gresh@us.ibm.com --=_alternative 006696D685257336_=--