Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Date: Tue, 23 May 2006 23:38:17 -0700 (PDT)
From: Chris Hostetter <hossman_lucene@fucit.org>
To: java-dev@lucene.apache.org
Subject: Re: SweetSpotSimiliarity
In-Reply-To: <F9F270C4-FA1E-460F-A54F-E2E56AAD0286@rectangular.com>
Message-ID: <Pine.LNX.4.58.0605232225480.5337@hal.rescomp.berkeley.edu>
References: <9807227.1148432130297.JavaMail.jira@brutus>
 <F9F270C4-FA1E-460F-A54F-E2E56AAD0286@rectangular.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII


: Presumably you had this in the can, and didn't just implement it
: today. :)  For those of you who didn't see this afternoon's thread

correct, I've been using it for a few months, and ment to contribute it
last week .. but i forgot until today's discusion of customizing per
field.

: originally, though I didn't see that thread.  Earlier discussion at
: http://xrl.us/mpkp (Link to mail-archives.apache.org).  Mark's nifty
: graph is still up (linked from his email).

Wow ... i was arround for that thread, and looking at it now, i remember
the graph -- but at the time i was on sabatical and hadn't even started
thinking about score issues (i was only worried about Filters and BitSet
intersections).  Had i remembered that thread when i start looking at how
Similarity worked ~Nov2005 I would have saved myself a lot of time and
headaches.

: "stub" documents Lucene tends to favor.  However, it also stopped
: excellent matches in fields which are supposed to be short -- like
: title -- from getting a good solid lift.

: The only answer seems to be to apply different lengthNorm algos to
: different fields.

or just use the same formula, but with different constants... at which
points they aren't constants, but you know what i mean.

: What uses have you found a plateau lengthNorm, Hoss?

Primarily bad data: I want fields that are not too short, not too
long ... just right.

If i get data from a source i can't trust i want to make sure fields that
are typically short are rewarded for being short, but penalized for being
trivial (one word RSS titles from the thread you mentioned are a perfect
example of what i mean)

: > 2) a baseline tf that provides a fixed value for tf's up to a
: > minimum, at which point it becomes a sqrt curve (this is used by
: > the tf(int) function.
: > 3) a hyperbolic tf function which is best explained by graphing the
: > equation.  this isn't used by default, but is available for
: > subclasses to call from their own tf functions.
:
: ... and when do you use these custom tf's?

honestly, i don't rememebr if i even use the baselineTf anymore (I think i
outgrew it) but it's a simple step up from the default that comes
in handy when 2 isn't really much better then 3 ... without
requiring you to buy in to the crazy hyperbolic tf thing that i came up
with on a whim and discovered it worked pretty well for me.

it has the nice property of giving small increases as the frequency
increases a small amount, then increasing faster once you reach the point
where you think small increases are significant, and then grows slower
again once you are above the point where you think more occurances are
acctually significant.

: I tried to graph the hyperbolic function (tip for OS X users: check
: out Grapher.app, in Utilities).  It looks like by default, everything
: cancels out it returns a constant 2.  But it's pretty complicated, so
: maybe I missed something.

Hmm... maybe i screwed up the defaults at some point ... i use other
values myself, gnuplot shows the defaults returning ~2 for all values
greater then 15, and ~0 for all values less then 5 and a gradient from 0
to 2 between 5 and 15.  "e" probably isn't the best base.

: design though no formal IR training.   Today, he wrote, "The title is
: not a discussion.  It's binary; this is being considered or it
: isn't.  The more words that are being considered, the less
: significant any one is, but you can't get more considered by being
: mentioned more than once in the title."

Very well put.

: I think I would implement this by having tf always return 1 for the
: title field.

Alas ... tf() doesn't take in a field name, to do this, you'd have to
override the Similarity each time your construct a query object,
something like this i believe...

   Query q = new TermQuery(t) {
      public Similarity getSimilarity(Searcher s) {
         return new SimilarityDelegator
            (TermQuery.this.super.getSimilarity(s)) {
               public float tf(freq) {
                 ...
               }
            }
         }
      }
   }

...but good lord if that isn't a pain.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org