Note that the dot product in the vector space world is heavily assoicated with the concept
of correlation coeficiet n statistics:
"A correlation coefficient is a number between 1 and 1 which measures the degree to which
two variables are linearly related. If there is perfect linear relationship with positive
slope between the two variables, we have a correlation coefficient of 1; if there is positive
correlation, whenever one variable has a high (low) value, so does the other. If there is
a perfect linear relationship with negative slope between the two variables, we have a correlation
coefficient of 1; if there is negative correlation, whenever one variable has a high (low)
value, the other has a low (high) value. A correlation coefficient of 0 means that there is
no linear relationship between the variables.
There are a number of different correlation coefficients that might be appropriate depending
on the kinds of variables being studied."
Yet another definition: "The correlation coefficient is a quantity that gives the quality
of a least squares fitting to the original data.". Least square fitting is also used in the
Knearest neighbor algorithms for other types of classification and similarity/relevance calculations.
>From the view of probabilistic models of information retrieval the dot product of TFIDF
wieghts s equivalent to a baysian infrerence of the probablities of a document being relevant
given the query with no prior (knowledge) of relevancy asumming (very naively) that the words
are independent from each other.
A very interesting presentation about formal models for IR including most recent Language
(generation) models is exposed at http://www.sis.pitt.edu/~erasmus/week4.ppt
More informtion about different models and how they relate can be found in "Modern Information
Retrieval:" http://www.sims.berkeley.edu/~hearst/irbook/chapters/chap2.html
The main problem that I have with the vectorspace and probabilistic IR models and algorithms
is that they all assume that there is a linear relationship between the query and the document
( as to how a document to document distance or similarity is calculated) and that words are
independent and follow a jointly normal distribution. I find interesting how in statistics
however people have been able to work around these assumption and come up with things such
as the Spearman rank correlation coeficient.
''The Spearman rank correlation coefficient is one example of a correlation coefficient. It
is usually calculated on occasions when it is not convenient, economic, or even possible to
give actual values to variables, but only to assign a rank order to instances of each variable.
It may also be a better indicator that a relationship exists between two variables when the
relationship is nonlinear.
Commonly used procedures, based on the Pearson's Product Moment Correlation Coefficient, for
making inferences about the population correlation coefficient make the implicit assumption
that the two variables are jointly normally distributed. When this assumption is not justified,
a nonparametric measure such as the Spearman Rank Correlation Coefficient might be more appropriate"
This also seams like a good start also for calculations of merged relevance ranking when exact
values of multiple ranking systems/algorithms cannot be obtain or are not compatible but ranking
order is available.
Sorry fr this long email Just my 2 cents.
Joaquin Delgado, PhD
CTO, TripleHop Technologies, Inc.
________________________________
From: Christoph Goller [mailto:goller@apache.org]
Sent: Sun 10/31/2004 11:00 AM
To: Lucene Developers List
Subject: About Hit Scoring
I looked at the scoring mechanism more closely again. Some of you may
remember that there was a discussion about this recently. There was
especially some argument about the theoretical justification of
the current scoring algorithm. Chuck proposed that at least from
a theoretical perspective it would be good to apply a normalization
on the document vector and thus implement the cosine similarity.
Well, we found out that this cannot be implemented efficienty.
However, I now found out the the current algorithm has a very
intuitive theoretical justification. Some of you may already know
that, but I never looked into it that deeply.
Both the query and all documents are represented as vectors in term
vector space. The current scoring is simply the dot product of the
query with a document normalized by the length of the query vector
(if we skip the additional coord factor). Geometrically speaking this
is the distance of the document vector from the hyperplane through
the origin which is orthogonal to the query vector. See attached
figure.
Christoph
