lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject RE: SweetSpotSimilarity
Date Mon, 05 Mar 2012 19:26:07 GMT

: very small to occasionally very large.  It also might be the case that 
: cover letters and e-mails while short might not be really something to 
: heavily discount.  The lower discount range can be ignored by setting 
: the min of any sweet spot to 1.  Then one starts to wonder if there is 
: really is any level area.

I would definitley not suggest using SSS for fields like legal brief text 
or emails where there is huge variability in the length of the content -- 
i can't think of any context where a "short" email is definitively 
better/worse then a "long" email.  more traditional TF/IDF seems like it 
would make more sense there.

: When I get that deep in the code the issue is not simply the shape of 
: the equation, but issues like how tweaking any parameters effects the 
: overall document scores.  For example, consider the comments about 
: "steepness" related to length norm.  It talks (some) mathematics of the 
: equation, but until one spends some time with that equation and 
: understanding where they all fit together, I doubt it jumps out at most 
: folks what large or smaller values mean for terms and resulting document 
: scores.
: One obvious hard to tease out part of the Similarity API is when each 
: part is called -- the simplest being index time vs. search time -- there 

well ... hopefully the Similarity docs and the the docs on Lucene scoring 
have filled in most of those blanks before you drill down into the 
specifics of how SSS work.  if not, then any concrete improvements you can 
suggest would certainly be apprecaited...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message