Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 11600 invoked from network); 24 May 2006 04:55:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 24 May 2006 04:55:46 -0000 Received: (qmail 568 invoked by uid 500); 24 May 2006 04:55:44 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 548 invoked by uid 500); 24 May 2006 04:55:43 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 536 invoked by uid 99); 24 May 2006 04:55:43 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 May 2006 21:55:43 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [12.154.210.214] (HELO rectangular.com) (12.154.210.214) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 May 2006 21:55:42 -0700 Received: from p00.ohwy.com ([12.154.210.213] helo=[192.168.4.34]) by rectangular.com with esmtpa (Exim 4.44) id 1FilZ3-000LWw-2W for java-dev@lucene.apache.org; Tue, 23 May 2006 22:05:45 -0700 Mime-Version: 1.0 (Apple Message framework v750) In-Reply-To: <9807227.1148432130297.JavaMail.jira@brutus> References: <9807227.1148432130297.JavaMail.jira@brutus> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Marvin Humphrey Subject: SweetSpotSimiliarity Date: Tue, 23 May 2006 21:55:19 -0700 To: java-dev@lucene.apache.org X-Mailer: Apple Mail (2.750) X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [note: subject header changed from "Re: [jira] Updated: (LUCENE-577) SweetSpotSimiliarity"] Thought-provoking stuff, Hoss... On May 23, 2006, at 5:55 PM, Hoss Man (JIRA) wrote: > This is a new Similarity implimention for the contrib/ > miscellaneous/ package, it provides a Similiarty designed for > people who know the "sweetspot" of their data. three major pieces > of functionality are included: > 1) a lengthNorm which creates a "plateau" of values. Presumably you had this in the can, and didn't just implement it today. :) For those of you who didn't see this afternoon's thread "Per-Field Analyzer" on java-user, KinoSearch has used a plateau lengthNorm since version 0.06... 1 / sqrt(max(100, numTerms)) ... and it's been a mixed bag. The suggestion came my way via Mark Bennett apparently from Doug originally, though I didn't see that thread. Earlier discussion at http://xrl.us/mpkp (Link to mail-archives.apache.org). Mark's nifty graph is still up (linked from his email). Making that algo the default achieved my goal: downgrade the type of "stub" documents Lucene tends to favor. However, it also stopped excellent matches in fields which are supposed to be short -- like title -- from getting a good solid lift. The only answer seems to be to apply different lengthNorm algos to different fields. What uses have you found a plateau lengthNorm, Hoss? > 2) a baseline tf that provides a fixed value for tf's up to a > minimum, at which point it becomes a sqrt curve (this is used by > the tf(int) function. > 3) a hyperbolic tf function which is best explained by graphing the > equation. this isn't used by default, but is available for > subclasses to call from their own tf functions. ... and when do you use these custom tf's? I tried to graph the hyperbolic function (tip for OS X users: check out Grapher.app, in Utilities). It looks like by default, everything cancels out it returns a constant 2. But it's pretty complicated, so maybe I missed something. My interest in this is being driven by a really savvy client with a formal mathematics background and a good feel for search engine design though no formal IR training. Today, he wrote, "The title is not a discussion. It's binary; this is being considered or it isn't. The more words that are being considered, the less significant any one is, but you can't get more considered by being mentioned more than once in the title." I think I would implement this by having tf always return 1 for the title field. Thought: It would be really handy if we had a benchmarking test for IR precision. Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org