lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)
Date Tue, 12 Dec 2006 06:41:56 GMT
> Well it doesn't since there is not justification of why it is the
> way it is. Its like saying, here is that car with 5 weels... enjoy
driving.

> >  - I think the explanations there would also answer at least some of
your
> > questions.

I hoped it would answer *some* of the questions... (not all)

Let me try some more :-)

> 2a) Why does Lucene normalise with Cosine Normalisation for the
> query? In a range of different IR system variations (as shown in
> Chisholm and Kolda's paper in the table on page 8) queries where not
> normalised at all. Is there a good reason or perhaps any empirical
> evidence that would support this decision?

I think "why normalizing at all" is answered in
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
... "queryNorm(q) is a normalizing factor used to make scores between
queries comparable. This factor does not affect document ranking (since all
ranked documents are multiplied by the same factor), but rather just
attempts to make scores from different queries (or even different indexes)
comparable. This is a search time factor computed by the Similarity in
effect at search time. The default computation in DefaultSimilarity is:
queryNorm(q) = 1/sqrt(sumOfSquaredWeights)"

I think there were discussions on this normalization. Anyhow I am not the
expert to justify this.

> 2b) What is the motivation for the normalisation imposed on the
> documents (norm_d_t) which I have not seen before in any other
> system. Again, does anybody have pointers to literature on this?

If I understand what you are asking, the logic is that verrry long
documents could virtually contain the entire collection, and therefore
should be "punished" for being too long, otherwise they would have an
"unfair advantage" upon short documents. Google "search engine document
length normalization" for more on this, or see also
http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf

> Any answer or partial answer on any of the questions would be
> greatly appreciated!

I hope this (although partial) helps at all,
Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message