Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 35529 invoked from network); 24 May 2006 06:38:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 24 May 2006 06:38:57 -0000 Received: (qmail 90113 invoked by uid 500); 24 May 2006 06:38:51 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 90080 invoked by uid 500); 24 May 2006 06:38:50 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 90069 invoked by uid 99); 24 May 2006 06:38:50 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 May 2006 23:38:50 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [169.229.70.167] (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 May 2006 23:38:49 -0700 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id D5EAD5B780; Tue, 23 May 2006 23:38:17 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id C919B7F403 for ; Tue, 23 May 2006 23:38:17 -0700 (PDT) Date: Tue, 23 May 2006 23:38:17 -0700 (PDT) From: Chris Hostetter To: java-dev@lucene.apache.org Subject: Re: SweetSpotSimiliarity In-Reply-To: Message-ID: References: <9807227.1148432130297.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N : Presumably you had this in the can, and didn't just implement it : today. :) For those of you who didn't see this afternoon's thread correct, I've been using it for a few months, and ment to contribute it last week .. but i forgot until today's discusion of customizing per field. : originally, though I didn't see that thread. Earlier discussion at : http://xrl.us/mpkp (Link to mail-archives.apache.org). Mark's nifty : graph is still up (linked from his email). Wow ... i was arround for that thread, and looking at it now, i remember the graph -- but at the time i was on sabatical and hadn't even started thinking about score issues (i was only worried about Filters and BitSet intersections). Had i remembered that thread when i start looking at how Similarity worked ~Nov2005 I would have saved myself a lot of time and headaches. : "stub" documents Lucene tends to favor. However, it also stopped : excellent matches in fields which are supposed to be short -- like : title -- from getting a good solid lift. : The only answer seems to be to apply different lengthNorm algos to : different fields. or just use the same formula, but with different constants... at which points they aren't constants, but you know what i mean. : What uses have you found a plateau lengthNorm, Hoss? Primarily bad data: I want fields that are not too short, not too long ... just right. If i get data from a source i can't trust i want to make sure fields that are typically short are rewarded for being short, but penalized for being trivial (one word RSS titles from the thread you mentioned are a perfect example of what i mean) : > 2) a baseline tf that provides a fixed value for tf's up to a : > minimum, at which point it becomes a sqrt curve (this is used by : > the tf(int) function. : > 3) a hyperbolic tf function which is best explained by graphing the : > equation. this isn't used by default, but is available for : > subclasses to call from their own tf functions. : : ... and when do you use these custom tf's? honestly, i don't rememebr if i even use the baselineTf anymore (I think i outgrew it) but it's a simple step up from the default that comes in handy when 2 isn't really much better then 3 ... without requiring you to buy in to the crazy hyperbolic tf thing that i came up with on a whim and discovered it worked pretty well for me. it has the nice property of giving small increases as the frequency increases a small amount, then increasing faster once you reach the point where you think small increases are significant, and then grows slower again once you are above the point where you think more occurances are acctually significant. : I tried to graph the hyperbolic function (tip for OS X users: check : out Grapher.app, in Utilities). It looks like by default, everything : cancels out it returns a constant 2. But it's pretty complicated, so : maybe I missed something. Hmm... maybe i screwed up the defaults at some point ... i use other values myself, gnuplot shows the defaults returning ~2 for all values greater then 15, and ~0 for all values less then 5 and a gradient from 0 to 2 between 5 and 15. "e" probably isn't the best base. : design though no formal IR training. Today, he wrote, "The title is : not a discussion. It's binary; this is being considered or it : isn't. The more words that are being considered, the less : significant any one is, but you can't get more considered by being : mentioned more than once in the title." Very well put. : I think I would implement this by having tf always return 1 for the : title field. Alas ... tf() doesn't take in a field name, to do this, you'd have to override the Similarity each time your construct a query object, something like this i believe... Query q = new TermQuery(t) { public Similarity getSimilarity(Searcher s) { return new SimilarityDelegator (TermQuery.this.super.getSimilarity(s)) { public float tf(freq) { ... } } } } } ...but good lord if that isn't a pain. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org