lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject RE: SweetSpotSimilarity
Date Tue, 28 Feb 2012 23:14:40 GMT

: A picture -- or more precisely a graph -- would be worth a 1000 words.  

fair enough.  I think the reason i never committed one initially was 
because the formula in the javadocs was trivial to plot in gnuplot...

gnuplot> min=0
gnuplot> max=2
gnuplot> base=1.3
gnuplot> xoffset=10
gnuplot> set yrange [0:3]
gnuplot> set xrange [0:20]
gnuplot> tf(x)=min+(max-min)/2*(((base**(x-xoffset)-base**-(x-xoffset))/(base**(x-xoffset)+base**-(x-xoffset)))+1)
gnuplot> plot tf(x)

i'll try to get some graphs commited and linked to from the javadocs that 
make it more clear how tweaking the settings affect the formula

: Another problem mentioned in the e-mail thread Chris linked is "people 
: who know the 'sweetspot' of their data.", but I have yet to find a 
: definition of what is meant by "sweetspot", so I couldn't say whether I 
: know my data's sweet spot or not.

hmmm... sorry, i kind of just always took it s self evident.  i'm not even 
sure how to define it ... the sweetspot is "the sweetspot" ... the range 
of good values such that things not in the sweetspot are atypical and 
"less good"

To give a practical example: when i was working with product data we found 
that the sweetspot for the length of a product name was between 4 and 10 
terms.  products with less then 4 terms in the name field were usually 
junk products (ie: "ram" or "mouse") and products with more then 10 terms 
in the name were usually junk products that had keyword stuffing going on.

likewise we determined that for fields like the "product description" the 
sweetspot for tf matching was arround 1-5 (if i remember correctly) ... 
because no one term appeared in a "well written" product description more 
then 5 times -- any more then that was keyword spamming.

every catalog of products is going to be different, and every domain is 
going to be *much* different (ie: if you search books, or encyclopedia 
articles then the sweetspots are going to be much larger)

: Another question is how the tf_hyper_offset parameter might be 
: considered.  It appears to be the inflexion point of the tanh equation, 
: but what term count might a caller consider centering there ( or 

right ... it's the center of your sweetspot if you use hyperbolicTf, you 
use the value that makes sense for your data.

: I also note that the JavaDoc says that the default tf_hyper_base ("the 
: base value to be used in the exponential for the hyperbolic function ") 
: value is e. But checking the code the default is actually 1.3 (less than 
: half e).  Should I file a doc bug?

I'll fix that (if i remember correctly, "e" is the canonical value 
typically used in doing hyperbolics for some reason, but for tf purposes 
made for a curve thta was too steep to be generally useful by default so 
we changed it as soon as it was committed) ... thanks for pointing out the doc mistake.


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message