lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian
Date Mon, 03 Jun 2013 11:02:34 GMT
This unfortunately is a limitation of the current FuzzySuggester
implementation: it computes edits in UTF-8 space instead of Unicode
character (code point) space.

This should be fixable: we'd need to fix TokenStreamToAutomaton to
work in Unicode character space, then fix FuzzySuggester to do the
same steps that FuzzyQuery does: do the LevN expansion in Unicode
character space, then convert that automaton to UTF-8, then intersect
with the suggest FST.

Could you open an issue for this?  I won't have any time soon to work
on this but we should open an issue to discuss / see if someone else
has time / iterate. Thanks!

Mike McCandless

On Thu, May 30, 2013 at 8:39 AM, Artem Lukanin <> wrote:
> BTW, I have to set maxEdits=2 to allow letter transpositions in Russian,
> because there will be actually 2 transpositions of 4 bytes representing 2
> Russian letters in UTF-8.
> The worst case is when one field has both Russian and English letters (or
> e.g. numbers), where I have to use minFuzzyLength=6 and maxEdits=2, which
> will work only for Russian words of more than 2 letters and for English
> words of more than 5 letters!
> --
> View this message in context:
> Sent from the Lucene - General mailing list archive at

View raw message