I looked at the scoring mechanism more closely again. Some of you may
remember that there was a discussion about this recently. There was
especially some argument about the theoretical justification of
the current scoring algorithm. Chuck proposed that at least from
a theoretical perspective it would be good to apply a normalization
on the document vector and thus implement the cosine similarity.
Well, we found out that this cannot be implemented efficienty.
However, I now found out the the current algorithm has a very
intuitive theoretical justification. Some of you may already know
that, but I never looked into it that deeply.
Both the query and all documents are represented as vectors in term
vector space. The current scoring is simply the dot product of the
query with a document normalized by the length of the query vector
(if we skip the additional coord factor). Geometrically speaking this
is the distance of the document vector from the hyperplane through
the origin which is orthogonal to the query vector. See attached
figure.
Christoph
