Karsten Konrad wrote:
> Now hell would be the place for me where I would have to prove that Lucene's ranking
is
> exactly equivalent to some transformation of vector space and then using the *cosine*
for the
> ranking. Can't be really, as Lucene sometimes returns results > 1.0 and only some
ruthless
> normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough
corners
> in Lucene where a good theorist could find some work.
One problem with computing the theoretically correct normalization is
that it changes each time the index is modified, and must be recomputed
for every document. This is because it includes IDFs, and, when the
index changes, IDFs change. That's when you normalize tf/idf vectors to
the unit sphere, i.e., norm=sqrt(sum_t(weight(t)^2)).
But research has also shown that this mathematically correct
normalization is not the best, e.g. Singhal et. al.'s work on pivoted
normalization:
http://citeseer.ist.psu.edu/singhal96pivoted.html
More generally, this shows that information retreival is not a
theoretical field, but rather a heuristic one. (Although someone may
flame me for saying that...) The vector space model is just a useful
analogy, not a verifiable theory of document meaning. It suggests some
formulae which can be improved through inspiration and experimentation.
Doug

To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
For additional commands, email: luceneuserhelp@jakarta.apache.org
