lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject idf^2
Date Sun, 17 Oct 2004 00:22:12 GMT
Doug Cutting wrote:
> If someone can demonstrate that an alternate formulation produces
> superior results for most applications, then we should of course
change
> the default implementation.  But just noting that there's a factor
which
> is equal to idf^2 in each element of the sum does not do this.

I researched the idf^2 issue further and believe that empirical studies
have consistently concluded that one idf factor should be dropped.
Salton, the originator of the IR vector space model, decided to drop the
idf term on documents in order to avoid the squaring.  I hear he did
this after studying recall and precision for many variations of his
formula.  Here is a quote from his Trec-3 paper, which references the
same comment in his Trec-1 and Trec-2 papers.  Note the final sentence:

  "A well-known term weighting system following that prescription
assigns
   weight wik to term Tk in query Qi in proportion to the frequency of 
   occurrence of the term in Qi, and in inverse proportion to the number
   of documents to which the term is assigned.[14, 12] Such a weighting 
   system is known as a tf x idf (term frequency times inverse document
   frequency) weighting system. In practice the query lengths, and hence
   the number of non-zero term weights assigned to a query, varies
widely.
   To allow a meaningful final retrieval similarity, it is convenient to
use
   a length normalization factor as part of the term weighting formula.
A
   high-quality term weighting formula for wik, the weight of term Tk
   in query Qi is

     wik= [ log(fik) + 1.0) * log(N/nk) ] /
          sqrt(sum(j=1, t) [(log(fij) + 1.0) * log(N/nj)]^2)        (1)

   where fik is the occurrence frequency of Tk in Qi, N is the
collection
   size, and nk is the number of documents with term Tk assigned. The
factor
   log(N/nk) is an inverse collection frequency ("idf") factor which
   decreases as terms are used widely in a collection, and the
denominator
   in expression (1) is used for weight normalization. This particular
form
   will be called "ltc" weighting within this paper. 

   The weights assigned to terms in documents are much the same. In
practice,
   for both effectiveness and efficiency reasons the idf factor in the
   documents is dropped.[2, 1]"

Similarly, another successful scoring formula that has been extensively
tuned through empirical studies, OKAPI, uses idf linearly and not
quadratically.  I checked with some friends that are more expert in this
area than me, Edwin Cooper the founder and Chief Scientist at InQuira,
and his father Bill Cooper, a pioneer in IR and professor emeritus at
Berkeley.  Both of them report that squaring the idf term seems
"strange" and is not consistent with the best known scoring formulas.

At InQuira, we did extensive empirical tests for relevance in many
different domains.  Our situation was different as the product retrieves
specific passages so our model was much more complex (extracting
sentences or paragraphs from a large corpus that specifically answer a
natural language question -- e.g. try a question like "what are roth vs
regular iras" in the search box at www.bankofamerica.com).  However, we
did include document relevance factors including tfidf and did not
square the idf.  None of our testing indicated that would have been an
improvement, although we did not explicitly try it.

I think that Lucene relevancy would be improved at least a little by, as
Salton puts it, dropping the idf factor in document term weights.  This
is easy to work-around with a final sqrt in the idf computation in the
similarity, although that seems a hack.

I don't think is a high priority issue as there is an easy workaround,
and application-specific relevance tuning is likely to be much more
important than the relatively small variations in recall/precision
achieved by tweaking these formulas.  However, Lucene is a very good
search engine and it seems right that it would have a "best of class"
scoring formula out of the box.

Along those same lines, I think my other suggestion of changing the
final normalization in Hits to achieve absolute comparability across
different searches is more important.  It's harder to work around the
current normalization scheme, and using absolute score thresholds to
limit and/or segregate results is a great capability.  As stated in my
earlier posting I believe the standard vector space normalization, i.e.
the denominator of Salton's formula above, with additional factors to
accounts for boosts and length normalization would accomplish that and
be a significant improvement over what Hits does now in Lucene.

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Wednesday, October 13, 2004 9:25 AM
> To: Lucene Developers List
> Subject: Re: Contribution: better multi-field searching
> 
> Paul Elschot wrote:
> >>Did you see my IDF question at the bottom of the original note?  I'm
> >>really curious why the square of IDF is used for Term and Phrase
> >>queries, rather than just IDF.  It seems like it might be a bug?
> >
> > I missed that.
> > It has been discussed recently, but I don't remember the outcome,
> > perhaps some else?
> 
> This has indeed been discussed before.
> 
> Lucene computes a dot-product of a query vector and each document
> vector.  Weights in both vectors are normalized tf*idf, i.e.,
> (tf*idf)/length.  The dot product of vectors d and q is:
> 
>    score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
> 
> Given this formulation, and the use of tf*idf weights, each component
of
> the sum has an idf^2 factor.  That's just the way it works with dot
> products of tf*idf/length vectors.  It's not a bug.  If folks don't
like
> it they can simply override Similarity.idf() to return sqrt(super()).
> 
> If someone can demonstrate that an alternate formulation produces
> superior results for most applications, then we should of course
change
> the default implementation.  But just noting that there's a factor
which
> is equal to idf^2 in each element of the sum does not do this.
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message