Hi,
currently I'm writing my Bachelorthesis about Lucene. I searched for
theoretical information for example about the IRmodel Lucene uses, but
I couldn't find anything so I had to figure it out on my own.
I think Lucene uses the vector space model with a variation of the
cosine measure (cosine measure is described in "modern information
retrieval"): Instead of a division by the length of the documentvector
it divides the score by the fieldNorm. So, the scoring formula can be
written as:
score(d,q)=sum (i=1 to t) ( wid* wiq/(sqrt(m)*q) )
t: number of distinct terms in the collection
wid: weight of term i in document d
wiq: weight of term i in the query
m: total number of terms in the field (sqrt(m)=fieldNorm)
q: length of queryvector q
The weight wid is the product of idf and tf. The weight wiq is just
idf. So the only difference to the cosine measure is the use of sqrt(m)
instead of the length of the documentvector, isn't it? Why? Is it too
difficult to compute this length?
Has anyone tried an index based on ngrams?
Thanks
Barbara
PS: Probably I will have more questions about Lucene in the next weeks :)

To unsubscribe, email: javauserunsubscribe@lucene.apache.org
For additional commands, email: javauserhelp@lucene.apache.org
