lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Subjects DB Matching
Date Mon, 06 Oct 2008 23:37:24 GMT

mauro: I assume you are working with the "Lucene-Java" package to build 
your software?  (as opposed to one of the other subprojects like 
Solr, Mahout, or Tika which are the other possibilities that aren't ruled 
out by your problem description).  If so you will probably get more 
feedback using hte java-user@lucene mailing list in the future.

In general the problem you are describing isn't easily solvable.  in order 
to determine a good "minimum cut off" score you have to be able to 
normalize your scores in a meaningful way -- to do that you have to be 
able to define what the "best" (or baseline) possible score for any query 
is.  this isn't something lucene can tell you for any arbitrary query, but 
it can be determined in special cases. (ie: for a simple TermQuery you can 
figure it out based on the idf and the document with hte highest tf; for 
"document similarity" type problems like MoreLIkeThis solves, you can get 
a good baseline by finding the score for document used to generate the 
MoreLikeThisQuery (but that requires that it be indexed)


-Hoss


Mime
View raw message