lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From manjula wijewickrema <manjul...@gmail.com>
Subject Re: Why not normalization?
Date Fri, 09 Jul 2010 10:30:45 GMT
Thanx

On Fri, Jul 9, 2010 at 1:10 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> > Thanks for your valuble comments. Yes I observed tha, once the number of
> > terms of the goes up, fieldNorm value goes down correspondingly. I think,
> > therefore there won't be any default due to the variation of total number
> of
> > terms in the document. Am I right?
>
> With the current scoring model advanced statistics are not available. There
> are currently some approaches to add BM25 support to Lucene, for what the
> index format needs to be enhanced to contain more statistics (number of
> terms per document, avg number of terms per document,...).
>
> > On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson <bec.watson@gmail.com>
> > wrote:
> >
> > > hi,
> > >
> > > > 1) Although Lucene uses tf to calculate scoring it seems to me that
> > > > term frequency has not been normalized. Even if I index several
> > > > documents, it does not normalize tf value. Therefore, since the
> > > > total number of words in index documents are varied, can't there be
> > > > a fault in Lucene's
> > > scoring?
> > >
> > > tf = term frequency i.e. the number of times the term appears in the
> > > document, while idf is inverse document frequency - is a measure of
> > > how rare a term is, i.e. related to how many documents the term
> > > appears in.
> > >
> > > if term1 occurs more frequently in a document i.e. tf is higher, you
> > > want to weight the document higher when you search for term1
> > >
> > > but if term1 is a very frequent term, ie. in lots of documents, then
> > > its probably not as important to an overall search (where we have
> > > term1, term2 etc) so you want to downweight it (idf comes in)
> > >
> > > then the normalisations like length normalisation (allow for 'fair'
> > > scoring across varied field length) come in too.
> > >
> > > the tf-idf scoring formula used by lucene is a  scoring method that's
> > > been around a long long time... there are competing scoring metrics
> > > but that's an IR thing and not an argument you want to start on the
> > > lucene lists! :)
> > >
> > > these are IR ('information retrieval') concepts and you might want to
> > > start by going to through the tf-idf scoring / some explanations for
> > > this kind of scoring.
> > >
> > > http://en.wikipedia.org/wiki/Tf%E2%80%93idf
> > > http://wiki.apache.org/lucene-java/InformationRetrieval
> > >
> > >
> > > > 2) What is the formula to calculate this fieldNorm value?
> > >
> > > in terms of how lucene implements its tf-idf scoring - you can see
> here:
> > > http://lucene.apache.org/java/3_0_2/scoring.html
> > >
> > > also, the lucene in action book is a really good book if you are
> > > starting out with lucene (and will save you a lot of grief with
> > > understanding lucene / setting up your application!), it covers all
> > > the basics and then moves on to more advanced stuff and has lots of
> > > code examples too:
> > > http://www.manning.com/hatcher2/
> > >
> > > hope that helps,
> > >
> > > bec :)
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message