Christoph, thanks for reading through my long postings and sharing your
thoughts. I had one comment in the first proposal email stating a
conclusion to move away from cosine normalization, but I didn't share
the reasons for this conclusion. Please let me know if you agree with
the following analysis.
I believe the central issue is the term sum(t) weight(t,d)^2, as Doug
pointed out. There seem to be two possible definitions for this term:
a) The sum extends over all the terms in the document
b) The sum just extends over the terms in the query
I don't know which of these definitions has been selected by systems
that implement cosine normalization, but my reading of Salton's papers
indicates that he intends definition a. I believe both definitions have
a fatal flaw:
a) For short queries the final scores are very small, not
subjectively reasonable. Also, as Doug pointed out, the cost of
maintaining this calculation under incremental changes to the collection
is prohibitive, and similarly I don't believe this can be computed
efficiently at query time.
b) Differentiation of scores for short queries is lost. E.g., all
matches to a one word query will have the score 1.0. All
differentiation based on tf and length norm is lost. This is because,
unlike what I've proposed, the cosine normalization uses a different
normalization factor for each result. Thus it changes the score ratios
for all hits, not just the top score. For a one word query, the vectors
have length one, whence
q.d = q d, and so q.d / q d = 1.0 for all results. In contrast,
my proposal will set the top score for a match to a one word query to
1.0, but all other scores will vary as they do now based on all scoring
factors.
So, cosine normalization looks like a loser to me. I'm not an expert in
this and may have the wrong analysis here. Do you see flaws in the
above?
I continue to believe this is an important problem and am very
appreciative that some others are digging into the issue. My specific
proposal has the benefit of not changing the score relationships
relative to Lucene today and so is good from a backwardcompatibility
standpoint. It is clearly better than the current normalization in
Hits. I think that setting the top score to its (net boost) / (total
boost) is not too bad, although as indicated in the proposal this could
be further refined in an attempt to also use other factors (tf, idf
and/or length norm) in the setting of the top score. I'm not sure
whether nor not using these additional factors in the normalization
would be a good thing and would appreciate other thoughts. (Remember
that all factors will be used in the scoring  the only question is
which are important in setting the normalized top score.)
I don't see any way to address this issue through subclassing  fixing
it seems to require modifying Lucene source. I'd rather not diverge
from Lucene source, especially in so many fundamental classes, and so
would like to see the changes incorporated back into Lucene. Is that
likely if I make the changes?
Thanks,
Chuck
> Original Message
> From: Christoph Goller [mailto:goller@detegosoftware.de]
> Sent: Thursday, October 21, 2004 9:47 AM
> To: Lucene Developers List
> Subject: Re: idf and explain(), was Re: Search and Scoring
>
> Chuck wrote:
> > That's a good point on how the standard vector space inner product
> > similarity measure does imply that the idf is squared relative to
the
> > document tf. Even having been aware of this formula for a long
time,
> > this particular implication never occurred to me.
>
> The same holds for me :)
>
> Chuck wrote:
> > I researched the idf2 issue further and believe that empirical
studies
> > have consistently concluded that one idf factor should be dropped.
> > Salton, the originator of the IR vector space model, decided to
drop
> the
> > idf term on documents in order to avoid the squaring. I hear he
did
> > this after studying recall and precision for many variations of
his
> > formula.
>
> The cosinedistance and dotproduct motivations should not be a
dogma.
> One
> can question the idea of using the same representation (as vector of
> terms)
> for a query and a document. The intuition behind not squaring idf
seems
> equally wellfound as the dotproduct/cosinedistance motivation.
> However, I also question empirical results here, since the right
scoring
> of documents is also very subjective :)
>
> Chuck wrote:
> > Regarding normalization, the normalization in Hits does not have
very
> > nice properties. Due to the > 1.0 threshold check, it loses
> > information, and it arbitrarily defines the highest scoring result
in
> > any list that generates scores above 1.0 as a perfect match. It
would
> > be nice if score values were meaningful independent of searches,
e.g.,
> > if "0.6" meant the same quality of retrieval independent of what
> search
> > was done. This would allow, for example, sites to use a a simple
> > quality threshold to only show results that were "good enough".
At my
> > last company (I was President and head of engineering for
InQuira), we
> > found this to be important to many customers.
> >
> > The standard vector space similarity measure includes
normalization by
> > the product of the norms of the vectors, i.e.:
> >
> > score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) /
> > sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
> weight(t,d)^2) ]
> >
> > This makes the score a cosine, which since the values are all
positive,
> > forces it to be in [0, 1]. The sumOfSquares() normalization in
Lucene
> > does not fully implement this. Is there a specific reason for
that?
>
> You are right. The normalization on the documents contribution is
> missing
> and there are the additional coordfactors. The current
implementation
> is a
> mixture of dotproduct and cosinedistance. Therefore, the
additional
> normalization in Hits is necessary if one wants to avoid scores >
1.0.
>
> Doug wrote:
> > The quantity 'sum(t) weight(t,d)2' must be recomputed for each
> document each
> > time a document is added to the collection, since 'weight(t,d)' is
> dependent
> > on global term statistics. This is prohibitivly expensive.
>
> weight(t,d) is currently computed on the fly in every scorer.
> BooleanScorer and
> ConjunctionScorer could collect the document normalization, and in
the
> end
> the searcher could apply normalization. All Scorers would have to be
> extended
> to supply document normalization. Does this seem reasonable?
>
> Chuck wrote:
> > To illustrate the problem better normalization is intended to
address,
> > in my current application for BooleanQuery's of two terms, I
> frequently
> > get a top score of 1.0 when only one of the terms is matched. 1.0
> > should indicate a "perfect match". I'd like set my UI up to
present
> the
> > results differently depending on how good the different results
are
> > (e.g., showing a visual indication of result quality, dropping the
> > really bad results entirely, and segregating the good results from
> other
> > only vaguely relevant results). To build this kind of
"intelligence"
> > into the UI, I certainly need to know whether my top result
matched
> all
> > query terms, or only half of them. I'd like to have the score
tell me
> > directly how good the matches are. The current normalization does
not
> > achieve this.
> >
> > The intent of a new normalization scheme is to preserve the
current
> > scoring in the following sense: the ratio of the nth result's
score
> to
> > the best result's score remains the same. Therefore, the only
> question
> > is what normalization factor to apply to all scores. This reduces
to
> a
> > very specific question that determines the entire normalization 
> what
> > should the score of the best matching result be?
>
> I would prefer your first idea with the cosine normalization, if an
> efficient
> implementation is possible. As stated above, I currently think it is
> possible.
>
> Christoph
>
>
>
>
>
>
>
>
>
>

> To unsubscribe, email: lucenedevunsubscribe@jakarta.apache.org
> For additional commands, email: lucenedevhelp@jakarta.apache.org

To unsubscribe, email: lucenedevunsubscribe@jakarta.apache.org
For additional commands, email: lucenedevhelp@jakarta.apache.org
