lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Allan Hill <>
Subject RE: SweetSpotSimilarity
Date Fri, 17 Feb 2012 19:41:38 GMT
> -----Original Message-----
> From: Chris Hostetter []
> As for what hyperbolicTf is trying to do ... it creates a hyperbolic function letting
you specify a hard max
> no matter how many terms there are.

A picture -- or more precisely a graph -- would be worth a 1000 words.  As it says in issue
577 "a hyperbolic tf function which is best explained by graphing the equation".  That's great,
but I couldn't find " Mark [Bennet's] nifty  graph [...] (linked from his email)."  Can anyone
provide any help locating what sounds like a useful resource?

The JavaDoc (which Chris probably also wrote way back when), says hyperbolic TANGENT function
( ).  At least that clarifies the basic shape, even if I
(and apparently others judging from the yearly questions on the Lucene list) have yet to work
out the full impact of all the parameters and how hyperbolic tangent might compare to the
1 / sqrt( freq + C ) of the baseline which I believe, if used with the defaults, degenerates
to formula.

Another problem mentioned in the e-mail thread Chris linked is "people who know the 'sweetspot'
of their data.", but I have yet to find a definition of what is meant by "sweetspot", so I
couldn't say whether I know my  data's sweet spot  or not.
Another question is how the tf_hyper_offset parameter might be considered.  It appears to
be the inflexion point of the tanh equation, but what term count might a caller consider centering
there ( or consider being the approx. area that the graph is "mostly" level)  ?  Or more simply
why 10?
Any thoughts from anyone?

I also note that the JavaDoc says that the default tf_hyper_base ("the base value to be used
in the exponential for the hyperbolic function ") value is e. But checking the code the default
is actually 1.3 (less than half e).  Should I file a doc bug?

To summarize: Does anyone have any resources along the lines of graphs of these (or any other)
tf functions, general discussion of document collection sweet spot, and any insight into 
parameters of this class (hyperbolic tangent or otherwise)?


> : > And I am aware that SweetSpotSimilarity resulted from this paper
> : >
> : >
> For the record, that paper did not result in SSS -- I wrote SSS ~Dec 2005 and contributed
it to Apache a
> few months later on behalf of CNET Networks where i developed it to solve some specific
> we had with product data...
> (and subsequent replies)
> ...Doron wrote the paper later, although you'll note lots of dicsussions arround that
time on the
> mailing list about customizing Similarity based on domain specific data -- the concepts
certainly weren't
> novel.
> : > However, I was wondering if there was a resource that explained (and gave examples)
of how SSS
> : > works and what each parameter (hyperbolic, etc) means. I know this is a Lucene
list but I am
> actually
> The functions are pretty clearly spelled out in the javadocs -- you just set the options
on the class to
> control the constant values of the functions.  The easiest way to understand them is
probably to use
> something like gnuplot to graph them using various values for the constants, and then
compare to
> graphs of the corrisponding functions from DefaultSimilarity.
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message