lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rishabh Bajpai" <r_baj...@lycos.com>
Subject boosting of terms while retrieving results? |
Date Fri, 07 Feb 2003 14:21:24 GMT

in one of the previous responses, the score associated with each of the term was interpreted
as described below. 
Can someone tell me how do I achieve point (3) mentioned below-There is a boost factor which
allows you to boost certain terms at query time (e.g. you value matches in title field more
than the body field boost title field queries)? 
Is this referring to the ^ wild-card, that we can use in constructing the query? 
if no, then how do we acheive this?
if yes, then what i am not very clear on is - whether this has to necessarily be user-supplied
or can it be what we construct in the query-parser? (does the user have to enter "this is
lucene^ powered" or can it be internally (in the query parser)decided that what we want to
boost b4 we fetch the results?)

----------------------------------------------------------------
1. The more frequent the term in a collection, the lower its weight (significance).  As very
popular words don't distinguish one document from the other much, because they are present
in so many docs.
2. The more frequent a word in a single document, the higher the documents 'value' when the
query contains that word.  So the score goes up for frequent words in a document, esp. if
they are not frequent in other documents in the collection.
3. There is a boost factor which allows you to boost certain terms at query time (e.g. you
value matches in title field more than the body field boost title field queries).
4. Normalization factor normalizes things so that longer documents don't have advantage over
shorter ones.
5. You can boost fields at index time (you'll have to use the nightly build for that instead
of the 1.2 release to get this).
6. Also, a keyword in a shorter document is deemed more significant than in a longer one because
it represents a larger percentage of the document.
----------------------------------------------------------
On the official FAQ page of the Lucene site, a formula is listed as 
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d
where:
  score_d   : score for document d
  sum_t     : sum for all terms t
  tf_q      : the square root of the frequency of t in the query
  tf_d      : the square root of the frequency of t in d
  idf_t     : log(numDocs/docFreq_t+1) + 1.0
  numDocs   : number of documents in index
  docFreq_t : number of documents containing t
  norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
  norm_d_t  : square root of number of tokens in d in the same field as t
  boost_t    : the user-specified boost for term t
  coord_q_d  : number of terms in both query and document / number of terms in query


-rishabh




_____________________________________________________________
Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message