lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Why not normalization?
Date Fri, 09 Jul 2010 07:40:09 GMT
> Thanks for your valuble comments. Yes I observed tha, once the number of
> terms of the goes up, fieldNorm value goes down correspondingly. I think,
> therefore there won't be any default due to the variation of total number
of
> terms in the document. Am I right?

With the current scoring model advanced statistics are not available. There
are currently some approaches to add BM25 support to Lucene, for what the
index format needs to be enhanced to contain more statistics (number of
terms per document, avg number of terms per document,...).

> On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson <bec.watson@gmail.com>
> wrote:
> 
> > hi,
> >
> > > 1) Although Lucene uses tf to calculate scoring it seems to me that
> > > term frequency has not been normalized. Even if I index several
> > > documents, it does not normalize tf value. Therefore, since the
> > > total number of words in index documents are varied, can't there be
> > > a fault in Lucene's
> > scoring?
> >
> > tf = term frequency i.e. the number of times the term appears in the
> > document, while idf is inverse document frequency - is a measure of
> > how rare a term is, i.e. related to how many documents the term
> > appears in.
> >
> > if term1 occurs more frequently in a document i.e. tf is higher, you
> > want to weight the document higher when you search for term1
> >
> > but if term1 is a very frequent term, ie. in lots of documents, then
> > its probably not as important to an overall search (where we have
> > term1, term2 etc) so you want to downweight it (idf comes in)
> >
> > then the normalisations like length normalisation (allow for 'fair'
> > scoring across varied field length) come in too.
> >
> > the tf-idf scoring formula used by lucene is a  scoring method that's
> > been around a long long time... there are competing scoring metrics
> > but that's an IR thing and not an argument you want to start on the
> > lucene lists! :)
> >
> > these are IR ('information retrieval') concepts and you might want to
> > start by going to through the tf-idf scoring / some explanations for
> > this kind of scoring.
> >
> > http://en.wikipedia.org/wiki/Tf%E2%80%93idf
> > http://wiki.apache.org/lucene-java/InformationRetrieval
> >
> >
> > > 2) What is the formula to calculate this fieldNorm value?
> >
> > in terms of how lucene implements its tf-idf scoring - you can see here:
> > http://lucene.apache.org/java/3_0_2/scoring.html
> >
> > also, the lucene in action book is a really good book if you are
> > starting out with lucene (and will save you a lot of grief with
> > understanding lucene / setting up your application!), it covers all
> > the basics and then moves on to more advanced stuff and has lots of
> > code examples too:
> > http://www.manning.com/hatcher2/
> >
> > hope that helps,
> >
> > bec :)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message