On Wednesday 20 April 2005 14:04, Barbara Krausz wrote:
> Hi,
> currently I'm writing my Bachelorthesis about Lucene. I searched for
> theoretical information for example about the IRmodel Lucene uses, but
> I couldn't find anything so I had to figure it out on my own.
> I think Lucene uses the vector space model with a variation of the
> cosine measure (cosine measure is described in "modern information
> retrieval"): Instead of a division by the length of the documentvector
> it divides the score by the fieldNorm. So, the scoring formula can be
> written as:
>
> score(d,q)=sum (i=1 to t) ( wid* wiq/(sqrt(m)*q) )
>
> t: number of distinct terms in the collection
> wid: weight of term i in document d
> wiq: weight of term i in the query
> m: total number of terms in the field (sqrt(m)=fieldNorm)
> q: length of queryvector q
There is also a coordination factor:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
> The weight wid is the product of idf and tf. The weight wiq is just
> idf. So the only difference to the cosine measure is the use of sqrt(m)
> instead of the length of the documentvector, isn't it? Why? Is it too
> difficult to compute this length?
The square root is a variation on the power norms of the extended
boolean model. It helps to make each extra occurrence in the same document
less important.
The field length is computed under the hood by counting the indexed terms.
Its square root is stored as a single byte value in a special representation
with 3 bits mantissa and 5 bits exponent.
> Has anyone tried an index based on ngrams?
Nutch has bigrams for phrases with frequently occurring words.
Regards,
Paul Elschot

To unsubscribe, email: javauserunsubscribe@lucene.apache.org
For additional commands, email: javauserhelp@lucene.apache.org
