Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 25519 invoked from network); 8 Jul 2005 12:20:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 Jul 2005 12:20:01 -0000 Received: (qmail 50401 invoked by uid 500); 8 Jul 2005 12:19:48 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 50314 invoked by uid 500); 8 Jul 2005 12:19:47 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 50282 invoked by uid 99); 8 Jul 2005 12:19:47 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jul 2005 05:19:47 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=FROM_ENDS_IN_NUMS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [66.163.170.8] (HELO smtp110.mail.sc5.yahoo.com) (66.163.170.8) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 08 Jul 2005 05:19:44 -0700 Received: (qmail 91566 invoked from network); 8 Jul 2005 12:19:43 -0000 Received: from unknown (HELO ?192.168.1.17?) (dmsmith555@67.39.27.222 with plain) by smtp110.mail.sc5.yahoo.com with SMTP; 8 Jul 2005 12:19:43 -0000 Message-ID: <42CE6F6E.3030605@gmail.com> Date: Fri, 08 Jul 2005 08:19:58 -0400 From: DM Smith User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem References: <200507072039.j67Kcvg0138064@pimout7-ext.prodigy.net> In-Reply-To: <200507072039.j67Kcvg0138064@pimout7-ext.prodigy.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N At crosswire.org we are using Lucene to index Bibles with each Bible having its own index and each verse in the Bible is a document in the index. So each document is short. Length depends upon the language of translation, but the lengths are from 2 to less than 100. In our case the existing bias seems appropriate and it does not appear to break down for extremely short documents. I would suggest that if the bias is changed that it be based upon the length and distribution of documents in the index. Or it be driven by programmer supplied parameters. Mark Bennett wrote: >Our client, Rojo, is considering overriding the default implementation of >lengthNorm to fix the bias towards extremely short RSS documents. > >The general idea put forth by Doug was that longer documents tend to have >more instances of matching words simply because they are longer, whereas >shorter documents tend to be more precise and should therefore be considered >more authoritative. > >While we generally agree with this idea, it seems to break down for >extremely short documents. For example, one and two word documents tend to >be test messages, error messages, or simple answers with no accompanying >context. > >I've seen discussions of this before from Doug, Chuck, Kevin and Sanji; >likely others have posted as well. We'd like to get your feedback on our >current idea for a new implementation, and perhaps eventually see about >getting the default Lucene formula changed. > >Pictures speak louder than words. I've attached a graph of what I'm about >to talk about, and if the attachment is not visible, I've also posted it >online at: >http://ideaeng.com/customers/rojo/lucene-doclength-normalization.gif > >Looking at the graph, the default Lucene implementation is represented by >the dashed dark-purple line. As you can see it's giving the highest scores >for documents with less than 5 words, with the max score going to single >word documents. Doug's quick fix for clipping the score for documents with >less than 100 terms is shown in light purple. > >Rojo's idea was to target documents of a particular length (we've chosen 50 >for this graph), and then have a smooth curve that slopes away from there >for larger and smaller documents. The red, green and blue curves are some >experiments I did trying to stretch out the standard "bell curve" (see >http://en.wikipedia.org/wiki/Normal_distribution) > >The "flat" and "stretch" factors are specific to my formula. I've tried >playing around with how gradual the curve slopes away for smaller and larger >documents; for example, the red curve really "punishes" documents with less >than 5 words. > >We'd really appreciate your feedback on this, as we do plan to do >"something". After figuring out what the curve "should be", the next items >on our end are implementation and fixing our excising indices, which I'll >save for a later post. > >Thanks in advance for your feedback, >Mark Bennett >mbennett@ideaeng.com >(on behalf of rojo.com) > > > > > > > >------------------------------------------------------------------------ > >--------------------------------------------------------------------- >To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >For additional commands, e-mail: java-dev-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org