Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 705D4CB2F for ; Wed, 19 Jun 2013 11:56:24 +0000 (UTC) Received: (qmail 361 invoked by uid 500); 19 Jun 2013 11:56:22 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 303 invoked by uid 500); 19 Jun 2013 11:56:21 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 289 invoked by uid 99); 19 Jun 2013 11:56:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jun 2013 11:56:20 +0000 Date: Wed, 19 Jun 2013 11:56:20 +0000 (UTC) From: "Artem Lukanin (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-5030: ---------------------------------- Attachment: nonlatin_fuzzySuggester.patch now tests in FuzzySuggesterTest and AnalyzingSuggesterTest pass, except for AnalyzingSuggesterTest.testRandom (when preserveSep = true). If I enable VERBOSE, I see, that suggestions are correct. I guess, there is a bug in the test, but I cannot find it. Can you please review? > FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters > ------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: LUCENE-5030 > URL: https://issues.apache.org/jira/browse/LUCENE-5030 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 4.3 > Reporter: Artem Lukanin > Attachments: nonlatin_fuzzySuggester1.patch, nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch > > > There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. > This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. > See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org