lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Problem / question concerning "Fuzzy Search"
Date Mon, 29 Mar 2010 15:05:19 GMT
On Mon, Mar 29, 2010 at 10:57 AM, Benjamin Patrick Jung
<bpjung@terreon.de>wrote:

>
> [Examples] Search term --> Subset of expected result
>  Cinamo~0.5 --> Cinema, Cinnamon [works]
>  Strawbarr~0.8 --> Strawberry    [doesn't work]
>
> -->
> As far as I understand, the "Edit distance"
> (aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
> is 2 (one replacement and one insertion to transform "Strawbarr" into
> "Strawberry")
>
>
yes you are correct, the scaling is a bit strange in my opinion. you can see
it in FuzzyTermsEnum's javadocs (if you look at the code):

Similarity returns a number that is 1.0f or less (including negative
numbers) based on how similar the Term is compared to a target term.  It
returns
exactly 0.0f when

    editDistance > maximumEditDistance

Otherwise it returns:

    1 - (editDistance / length)

where length is the length of the shortest term (text or target) including a
prefix that are identical and editDistance is the Levenshtein distance for
the two words.


I think other implementations instead tend to use 1 - (editDistance /
length) for scaling, where length is the length of the longest term.

-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message