lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian
Date Mon, 03 Jun 2013 11:02:34 GMT
This unfortunately is a limitation of the current FuzzySuggester
implementation: it computes edits in UTF-8 space instead of Unicode
character (code point) space.

This should be fixable: we'd need to fix TokenStreamToAutomaton to
work in Unicode character space, then fix FuzzySuggester to do the
same steps that FuzzyQuery does: do the LevN expansion in Unicode
character space, then convert that automaton to UTF-8, then intersect
with the suggest FST.

Could you open an issue for this?  I won't have any time soon to work
on this but we should open an issue to discuss / see if someone else
has time / iterate. Thanks!

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 30, 2013 at 8:39 AM, Artem Lukanin <ice_lc@mail.ru> wrote:
> BTW, I have to set maxEdits=2 to allow letter transpositions in Russian,
> because there will be actually 2 transpositions of 4 bytes representing 2
> Russian letters in UTF-8.
>
> The worst case is when one field has both Russian and English letters (or
> e.g. numbers), where I have to use minFuzzyLength=6 and maxEdits=2, which
> will work only for Russian words of more than 2 letters and for English
> words of more than 5 letters!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067026.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

Mime
View raw message