Return-Path: X-Original-To: apmail-incubator-lucy-user-archive@www.apache.org Delivered-To: apmail-incubator-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 502BD763D for ; Thu, 1 Dec 2011 17:41:34 +0000 (UTC) Received: (qmail 26241 invoked by uid 500); 1 Dec 2011 17:41:34 -0000 Delivered-To: apmail-incubator-lucy-user-archive@incubator.apache.org Received: (qmail 26209 invoked by uid 500); 1 Dec 2011 17:41:34 -0000 Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-user@incubator.apache.org Delivered-To: mailing list lucy-user@incubator.apache.org Received: (qmail 26201 invoked by uid 99); 1 Dec 2011 17:41:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2011 17:41:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2011 17:41:27 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1RWAXt-0002rA-F9 for lucy-user@incubator.apache.org; Thu, 01 Dec 2011 09:35:41 -0800 Date: Thu, 1 Dec 2011 09:35:41 -0800 From: Marvin Humphrey To: lucy-user@incubator.apache.org Message-ID: <20111201173541.GA10932@rectangular.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Subject: Re: [lucy-user] $boost importance in weighting On Thu, Dec 01, 2011 at 12:07:47PM +0200, goran kent wrote: > The page at http://incubator.apache.org/lucy/docs/perl/Lucy/Plan/FieldType.html > is a bit sparse on detail about the boost property. > I'd like to get a better understanding of how and by how much it's > value influences score (rank) in search results - what's the formula > used when boost is applied to a document's score? It's pretty complicated. Field boost, document boost, and field length normalization are all consolidated, then they are reduced down to a single 8-bit float with a 3-bit mantissa and a 5-bit exponent. Because of the coarseness of the lossy data compression, small changes to boost may not even move the needle. I wouldn't bother with a field or document boost multiplier that doesn't change things by at least a factor of 2. It's theoretically possible to calculate ceiling and floor values for boost, but I don't know what the answers are. > Finally, what are reasonable values (upper/lower) for boost when, in > my case eg, I'd like to influence the score based on an external value > (page rank), but not have my page rank completely skew the scores - > just enough to promote pages which have an organic page rank value > which should be considered to some degree (a very broad subject, I > know). Subtle rerankings are problematic because search engines are noisy. Even the best ones give you a bunch of junk you don't need. We don't really care about fine distinctions, because if you sample a handful of documents with identical scores, odds are that they are *wildly* divergent in terms of what the user wants. We only care about big differences. > My tests so far show that a boost value with a small variance in the > mantissa has an almost zero influence on score/ranking. My thinking > is to boost with something akin to $boost+=LogN(PR) - ie between 0-10 > (log scale). So this boils down to: is using a scale of 1-10 a good > idea w.r.t. the Lucy boost property to influence ranking, or 10x that > value? I'd try 1-100. If that's too much, scale it back. Marvin Humphrey