lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barbara Krausz <>
Subject Scoring, cosine measure
Date Wed, 20 Apr 2005 12:04:32 GMT
currently I'm writing my Bachelorthesis about Lucene. I searched for 
theoretical information for example about the IR-model Lucene uses, but 
I couldn't find anything so I had to figure it out on my own.
I think Lucene uses the vector space model with a variation of the 
cosine measure (cosine measure is described in "modern information 
retrieval"): Instead of a division by the length of the documentvector 
it divides the score by the fieldNorm. So, the scoring formula can be 
written as:

score(d,q)=sum (i=1 to t) ( wid* wiq/(sqrt(m)*|q|) )

t: number of distinct terms in the collection
wid: weight of term i in document d
wiq: weight of term i in the query
m: total number of terms in the field  (sqrt(m)=fieldNorm)
|q|: length of queryvector q

The weight wid is the product of idf and tf. The weight wiq is just 
idf.  So the only difference to the cosine measure is the use of sqrt(m) 
instead of the length of the documentvector, isn't it? Why? Is it too 
difficult to compute this length?
Has anyone tried an index based on n-grams?


PS: Probably I will have more questions about Lucene in the next weeks :-)

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message