Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 13734 invoked from network); 9 Jul 2010 10:31:17 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Jul 2010 10:31:17 -0000 Received: (qmail 34938 invoked by uid 500); 9 Jul 2010 10:31:15 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 34821 invoked by uid 500); 9 Jul 2010 10:31:13 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 34813 invoked by uid 99); 9 Jul 2010 10:31:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jul 2010 10:31:13 +0000 X-ASF-Spam-Status: No, hits=4.4 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of manjula53@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jul 2010 10:31:06 +0000 Received: by mail-vw0-f48.google.com with SMTP id 10so3350310vws.35 for ; Fri, 09 Jul 2010 03:30:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=EXfXYpr+1MJG40WvBYAXd2QdmxOwEDbK6Zwy2JwV7SY=; b=t+WtOEfCqkgJASEpq93L43/HgrVSvp45texivO7j5mreNHSMHSi55DxoUmB5sLEXAz 0OpPgJTn3oOboZ36kEleqzqt4Zr5lqhE1ijWtxTbwlAQH0+UXAsTw8pvE63ugSodIDR7 w0Rc+/TMf6RmvYi3VuyA4KvCzZEiqNnDkFUEk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=xVWsx3dL22GGAJlM0jdpgCBpDwwtDJTyc5i/DwRNjUKDHP8kIyLpbfYSycms73elmP 3alpsTAhhCbckFMDU/Eu8QT6wQ9MIwiMAi33auK+VWl/UoD0IxNclD6WtUGC9Z1kxvzA Hpwk9XXVY82W4YTKazYWacbk5EP72PXsvQ6vk= MIME-Version: 1.0 Received: by 10.220.61.140 with SMTP id t12mr4158260vch.194.1278671446117; Fri, 09 Jul 2010 03:30:46 -0700 (PDT) Received: by 10.220.191.196 with HTTP; Fri, 9 Jul 2010 03:30:45 -0700 (PDT) In-Reply-To: <025701cb1f39$f55f46d0$e01dd470$@thetaphi.de> References: <025701cb1f39$f55f46d0$e01dd470$@thetaphi.de> Date: Fri, 9 Jul 2010 16:00:45 +0530 Message-ID: Subject: Re: Why not normalization? From: manjula wijewickrema To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=e0cb4e887eb3548e5d048af1e521 X-Virus-Checked: Checked by ClamAV on apache.org --e0cb4e887eb3548e5d048af1e521 Content-Type: text/plain; charset=ISO-8859-1 Thanx On Fri, Jul 9, 2010 at 1:10 PM, Uwe Schindler wrote: > > Thanks for your valuble comments. Yes I observed tha, once the number of > > terms of the goes up, fieldNorm value goes down correspondingly. I think, > > therefore there won't be any default due to the variation of total number > of > > terms in the document. Am I right? > > With the current scoring model advanced statistics are not available. There > are currently some approaches to add BM25 support to Lucene, for what the > index format needs to be enhanced to contain more statistics (number of > terms per document, avg number of terms per document,...). > > > On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson > > wrote: > > > > > hi, > > > > > > > 1) Although Lucene uses tf to calculate scoring it seems to me that > > > > term frequency has not been normalized. Even if I index several > > > > documents, it does not normalize tf value. Therefore, since the > > > > total number of words in index documents are varied, can't there be > > > > a fault in Lucene's > > > scoring? > > > > > > tf = term frequency i.e. the number of times the term appears in the > > > document, while idf is inverse document frequency - is a measure of > > > how rare a term is, i.e. related to how many documents the term > > > appears in. > > > > > > if term1 occurs more frequently in a document i.e. tf is higher, you > > > want to weight the document higher when you search for term1 > > > > > > but if term1 is a very frequent term, ie. in lots of documents, then > > > its probably not as important to an overall search (where we have > > > term1, term2 etc) so you want to downweight it (idf comes in) > > > > > > then the normalisations like length normalisation (allow for 'fair' > > > scoring across varied field length) come in too. > > > > > > the tf-idf scoring formula used by lucene is a scoring method that's > > > been around a long long time... there are competing scoring metrics > > > but that's an IR thing and not an argument you want to start on the > > > lucene lists! :) > > > > > > these are IR ('information retrieval') concepts and you might want to > > > start by going to through the tf-idf scoring / some explanations for > > > this kind of scoring. > > > > > > http://en.wikipedia.org/wiki/Tf%E2%80%93idf > > > http://wiki.apache.org/lucene-java/InformationRetrieval > > > > > > > > > > 2) What is the formula to calculate this fieldNorm value? > > > > > > in terms of how lucene implements its tf-idf scoring - you can see > here: > > > http://lucene.apache.org/java/3_0_2/scoring.html > > > > > > also, the lucene in action book is a really good book if you are > > > starting out with lucene (and will save you a lot of grief with > > > understanding lucene / setting up your application!), it covers all > > > the basics and then moves on to more advanced stuff and has lots of > > > code examples too: > > > http://www.manning.com/hatcher2/ > > > > > > hope that helps, > > > > > > bec :) > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --e0cb4e887eb3548e5d048af1e521--