"Karl Koch" wrote:
> For the documents Lucene employs
> its norm_d_t which is explained as:
>
> norm_d_t : square root of number of tokens in d in the same field as t
Actually (by default) it is:
1 / sqrt(#tokens in d with same field as t)
> basically just the square root of the number of unique terms in the
> document (since I do search over all fields always). I would have
> expected cosine normalisation here...
>
> The paper you provided uses document normalisation in the following way:
>
> norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))
>
> I am not sure how this relates to norm_d_t.
That system is less "field oriented" than Lucene, so you could say the
normalization there goes over all the fields.
The {0.8,0.2} args are parametric and control how aggressive this
normalization is.
If you used there {0,1} you would get
1 / sqrt(#unique terms in d)
and that would be similar to Lucene's
1 / sqrt(#tokens in d with same field as t)
however (in that system) that would have punish long documents too much and
would too much boost up stupid dummy short documents, and that's why the
{0.8,0.2} were introduced there.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org