On Mon, Mar 29, 2010 at 10:57 AM, Benjamin Patrick Jung
<bpjung@terreon.de>wrote:
>
> [Examples] Search term > Subset of expected result
> Cinamo~0.5 > Cinema, Cinnamon [works]
> Strawbarr~0.8 > Strawberry [doesn't work]
>
> >
> As far as I understand, the "Edit distance"
> (aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
> is 2 (one replacement and one insertion to transform "Strawbarr" into
> "Strawberry")
>
>
yes you are correct, the scaling is a bit strange in my opinion. you can see
it in FuzzyTermsEnum's javadocs (if you look at the code):
Similarity returns a number that is 1.0f or less (including negative
numbers) based on how similar the Term is compared to a target term. It
returns
exactly 0.0f when
editDistance > maximumEditDistance
Otherwise it returns:
1  (editDistance / length)
where length is the length of the shortest term (text or target) including a
prefix that are identical and editDistance is the Levenshtein distance for
the two words.
I think other implementations instead tend to use 1  (editDistance /
length) for scaling, where length is the length of the longest term.

Robert Muir
rcmuir@gmail.com
