Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of gresh@us.ibm.com designates
 32.97.182.141 as permitted sender)
In-Reply-To: <Pine.LNX.4.58.0708112032070.23693@hal.rescomp.berkeley.edu>
To: java-user@lucene.apache.org
MIME-Version: 1.0
Subject: Re: How to implement cut of score ?
From: Donna L Gresh <gresh@us.ibm.com>
Message-ID: 
 <OF6D027BCD.4FE2A386-ON85257336.00659D6B-85257336.00669759@us.ibm.com>
Date: Mon, 13 Aug 2007 14:40:37 -0400
Content-Type: multipart/alternative;
 boundary="=_alternative 006696D685257336_="

--=_alternative 006696D685257336_=
Content-Type: text/plain; charset="US-ASCII"

Hoss wrote:
this would be meaningless even if it were easier...

http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03

FAQ: "Can I filter by score?"


-Hoss

I've read the warnings referenced there; but still have a problem to 
solve. We have "fact-based" information about
people and the jobs they might fill (availability dates, experience level, 
languages spoken, etc.) and we have textual
information about both the jobs and the people (e.g. resumes). We'd like 
to use the "goodness of match" of the
textual description of the job to the person's resume as a way to suggest, 
for example, additional people who
should be considered for the job, even if, say, their specific job title 
does not match the requested job title.

I can use the job description to construct a query (and I've done it in a 
variety of ways), but how best to choose which
of the returned people to allow to "fit" the job? An obviously desirable 
way to do it is using the score, but all discussion
seems to say "don't do that, since absolute score isn't meaningful" (and I 
don't use the normalized score, BTW, I use the
raw score, but the same caveats apply). Certainly the relative scores for 
a single query can be used to rank the goodness
of fit to that particular job, but that doesn't solve the problem in 
general. 

Should I just give up and return the "25 best" fits to the job, and only 
use score to rank them relative to one another? This
then means that a job description that has very few words that match 
*anything* in the collection of resumes will still
produce 25 people that "match". In practice (anecdotally) it does appear 
to me that when the highest score for a 
particular job description is fairly small (say 0.10) that there's a good 
reason for that, and when the highest score is 
something like 0.60, there's a good reason for that as well. That is, 
queries that yield a small "best score" are queries
for which I would not expect good matches, and vice versa. So it does seem 
(again anecdotally) that the score has
*some* relevance. What are the experts' thoughts on this?


Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com

--=_alternative 006696D685257336_=--