lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject PhraseQuery and edit distance slightly confusing.
Date Wed, 15 Mar 2006 17:02:47 GMT

Hi there,

I get the concept implemented in PhraseQuery but isn't calling it an 
edit distance a little bit far fetched? Only the marginal elements 
(minimum and maximum distance from their respective query positions) are 
taken into account. Consider this example:

phrase:     a  b  c  d
term pos:   0  1  2  3

document A: a  c  b  d
term pos:   0  1  2  3
pos. diff:  0 -1  1  0

=> slope = (1 - (-1)) = 2

document B: a  c  b  x  d
term pos:   0  1  2  3  4
pos. diff:  0 -1  1  -  1

=> slope = (1 - (-1) = 2

It's how it is currently implemented, isn't it? The scoring difference 
(attached example) is different just because "document" lengths are 
different, phrases themselves are scored identically even though I 
believe B should be penalized. A simple way to do it would be include 
phrase length divided by the matching span length... but I'm guessing 
it's implemented like that for a reason, just didn't know what that
reason might be ;)

Dawid





Mime
View raw message