lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From manjula wijewickrema <manjul...@gmail.com>
Subject Re: Why not normalization?
Date Fri, 09 Jul 2010 06:50:37 GMT
Hi Rebecca,

Thanks for your valuble comments. Yes I observed tha, once the number of
terms of the goes up, fieldNorm value goes down correspondingly. I think,
therefore there won't be any default due to the variation of total number of
terms in the document. Am I right?

Manjula.

On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson <bec.watson@gmail.com> wrote:

> hi,
>
> > 1) Although Lucene uses tf to calculate scoring it seems to me that term
> > frequency has not been normalized. Even if I index several documents, it
> > does not normalize tf value. Therefore, since the total number of words
> > in index documents are varied, can't there be a fault in Lucene's
> scoring?
>
> tf = term frequency i.e. the number of times the term appears in the
> document,
> while idf is inverse document frequency - is a measure of how rare a term
> is,
> i.e. related to how many documents the term appears in.
>
> if term1 occurs more frequently in a document i.e. tf is higher, you
> want to weight
> the document higher when you search for term1
>
> but if term1 is a very frequent term, ie. in lots of documents, then
> its probably not
> as important to an overall search (where we have term1, term2 etc) so you
> want
> to downweight it (idf comes in)
>
> then the normalisations like length normalisation (allow for 'fair' scoring
> across varied field length) come in too.
>
> the tf-idf scoring formula used by lucene is a  scoring method that's
> been around
> a long long time... there are competing scoring metrics but that's an IR
> thing
> and not an argument you want to start on the lucene lists! :)
>
> these are IR ('information retrieval') concepts and you might want to start
> by
> going to through the tf-idf scoring / some explanations for this kind
> of scoring.
>
> http://en.wikipedia.org/wiki/Tf%E2%80%93idf
> http://wiki.apache.org/lucene-java/InformationRetrieval
>
>
> > 2) What is the formula to calculate this fieldNorm value?
>
> in terms of how lucene implements its tf-idf scoring - you can see here:
> http://lucene.apache.org/java/3_0_2/scoring.html
>
> also, the lucene in action book is a really good book if you are starting
> out
> with lucene (and will save you a lot of grief with understanding
> lucene / setting
> up your application!), it covers all the basics and then moves on to more
> advanced stuff and has lots of code examples too:
> http://www.manning.com/hatcher2/
>
> hope that helps,
>
> bec :)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message