lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Murzaku <murz...@yahoo.com>
Subject Re: text format and scoring
Date Sat, 03 Aug 2002 21:13:26 GMT
Hi PA! How are things going?

It's an interesting question but I don't think Lucene
(as it is today) could change weights based on
semantics (either assigned by formatting tags or maybe
looked up in some dictionary like WordNet)...

Some time ago, Doug sent to this list the formula for
the score computation which is:

  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t
/ norm_d_t * boost_t) * coord_q_d

  score_d   : score for document d
  sum_t     : sum for all terms t
  tf_q      : the square root of the frequency of t in
the query
  tf_d      : the square root of the frequency of t in
d
  idf_t     : log(numDocs/docFreq_t+1) + 1.0
  numDocs   : number of documents in index
  docFreq_t : number of documents containing t
  norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
  norm_d_t  : square root of number of tokens in d in
the same field as t
  boost_t    : the user-specified boost for term t
  coord_q_d  : number of terms in both query and
document / number of terms in query

The only thing that counts is the frequency of the
terms in the document and among documents. 

A way to influence the final score might be to tweak
the real frequencies during indexing with some
parameters configured externally. Let's say if the
word is underlined then multiply its count by X. This
modified TF should influence the final score
accordingly.

Just a thought...

Alex


--- petite_abeille <petite_abeille@mac.com> wrote:
> Hello,
> 
> I was wandering what would be a good way to
> incorporate text format 
> information in Lucene word/document scoring. For
> example, when turning 
> HTML into plain text for indexing purpose, a lot of
> potentially useful 
> information are lost: eg tags like <bold>, <strong>
> and so on could be 
> understood as conveying emphasis information about
> some words. If 
> somebody took the pain to "underline" some words,
> why throw it away? 
> Assuming there is some interesting meaning in a
> document format/layout, 
> and a way to understand it and weight it, how could
> one incorporate this 
> information into document scoring?
> 
> Thanks for any insights :-)
> 
> PA.
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message