lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: AW: AW: Real Boolean Model in Lucene?
Date Mon, 01 Dec 2003 20:27:25 GMT
Karsten Konrad wrote:
> Now hell would be the place for me where I would have to prove that Lucene's ranking
is 
> exactly equivalent to some transformation of vector space and then using the *cosine*
for the 
> ranking. Can't be really, as Lucene sometimes returns results > 1.0 and only some
ruthless
> normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough
corners
> in Lucene where a good theorist could find some work.

One problem with computing the theoretically correct normalization is 
that it changes each time the index is modified, and must be recomputed 
for every document.  This is because it includes IDFs, and, when the 
index changes, IDFs change.  That's when you normalize tf/idf vectors to 
the unit sphere, i.e., norm=sqrt(sum_t(weight(t)^2)).

But research has also shown that this mathematically correct 
normalization is not the best, e.g. Singhal et. al.'s work on pivoted 
normalization:

   http://citeseer.ist.psu.edu/singhal96pivoted.html

More generally, this shows that information retreival is not a 
theoretical field, but rather a heuristic one.  (Although someone may 
flame me for saying that...)  The vector space model is just a useful 
analogy, not a verifiable theory of document meaning.  It suggests some 
formulae which can be improved through inspiration and experimentation.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message