Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 94456 invoked from network); 13 Nov 2009 23:46:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Nov 2009 23:46:03 -0000 Received: (qmail 62702 invoked by uid 500); 13 Nov 2009 23:46:02 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 62627 invoked by uid 500); 13 Nov 2009 23:46:02 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 62619 invoked by uid 99); 13 Nov 2009 23:46:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2009 23:46:02 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Nov 2009 23:46:00 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 07315234C1EF for ; Fri, 13 Nov 2009 15:45:40 -0800 (PST) Message-ID: <1091829027.1258155940014.JavaMail.jira@brutus> Date: Fri, 13 Nov 2009 23:45:40 +0000 (UTC) From: "Robert Muir (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-1488) multilingual analyzer based on icu In-Reply-To: <977513235.1229009264217.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1488: -------------------------------- Lucene Fields: [New, Patch Available] (was: [New]) Fix Version/s: 3.1 Assignee: Robert Muir Issue Type: New Feature (was: Wish) Summary: multilingual analyzer based on icu (was: issues with standardanalyzer on multilingual text) setting a fix version, setting a correct description of the issue > multilingual analyzer based on icu > ---------------------------------- > > Key: LUCENE-1488 > URL: https://issues.apache.org/jira/browse/LUCENE-1488 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Reporter: Robert Muir > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1 > > Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt > > > The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts. This is because it is unaware of unicode bounds properties. > I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. > in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). > I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier. > Thanks, > Robert -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org