lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1126) Simplify StandardTokenizer JFlex grammar
Date Wed, 03 Sep 2008 19:33:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628106#action_12628106
] 

Steven Rowe commented on LUCENE-1126:
-------------------------------------

Yeah, I see this too.

The issue is that the entire Thai range {{\u0e00-\u0e5b}} is included in the unpatched grammar's
{LETTER} definition, which contains the huge range {{\u0100-\u1fff}}, much of which is not
actually letters.  The patched grammar instead substitutes the Unicode 3.0 {{Letter}} general
category (via JFlex's [:letter:]), which excludes some characters in the Thai range: non-spacing
marks, a currency symbol, numerals, etc.

ThaiAnalyzer uses ThaiWordFilter, which uses Java's BreakIterator to tokenize the contiguous
text (i.e. without whitespace) provided by StandardTokenizer.

The failing test expects to see {{"\u0e17\u0e35\u0e48"}}, but instead gets {{"\u0e17"}}, because
{{\u0e35}} is a non-spacing mark, which the patched StandardTokenizer doesn't pass to ThaiWordFilter.

Because of this problem, I guess I'm -1 on applying the patch I provided.

One solution would be to switch from using the {{Letter}} general category to the derived
property {{Alphabetic}}, which includes both general categories {{Letter}} and {{Mark}}. (see
Annex C of [the Unicode Regular Expressions Technical Standard|http://www.unicode.org/unicode/reports/tr18/#Compatibility_Properties]
under "alpha" for discussion of this).  The current version of JFlex does not support Unicode
property references in its syntax, though, so simplifying -- and correcting -- the grammar
may have to wait for the next version of JFlex, which will support syntax like {{\p{Alphabetic}}}.


> Simplify StandardTokenizer JFlex grammar
> ----------------------------------------
>
>                 Key: LUCENE-1126
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1126
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1126.patch
>
>
> Summary of thread entitled "Fullwidth alphanumeric characters, plus a question on Korean
ranges" begun by Daniel Noll on java-user, and carried over to java-dev:
> On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> > I wish the tokeniser could just use Character.isLetter and
> > Character.isDigit instead of having to know all the ranges itself, since
> > the JRE already has all this information.  Character.isLetter does
> > return true for CJK characters though, so the ranges would still come in
> > handy for determining what kind of letter they are.  I don't support
> > JFlex has a way to do this...
> The DIGIT macro could be replaced by JFlex's predefined character class [:digit:], which
has the same semantics as java.lang.Character.isDigit().
> Although JFlex's predefined character class [:letter:] (same semantics as java.lang.Character.isLetter())
includes CJK characters, there is a way to handle this using JFlex's regex negation syntax
{{!}}.  From [the JFlex documentation|http://jflex.de/manual.html]:
> bq. [T]he expression that matches everything of {{a}} not matched by {{b}} is !(!{{a}}|{{b}})

> So to exclude CJ characters from the LETTER macro:
> {code}
>     LETTER = ! ( ! [:letter:] | {CJ} )
> {code}
>  
> Since [:letter:] includes all of the Korean ranges, there's no reason (AFAICT) to treat
them separately; unlike Chinese and Japanese characters, which are individually tokenized,
the Korean characters should participate in the same token boundary rules as all of the other
letters.
> I looked at some of the differences between Unicode 3.0.0, which Java 1.4.2 supports,
and Unicode 5.0, the latest version, and there are lots of new and modified letter and digit
ranges.  This stuff gets tweaked all the time, and I don't think Lucene should be in the business
of trying to track it, or take a position on which Unicode version users' data should conform
to.  
> Switching to using JFlex's [:letter:] and [:digit:] predefined character classes ties
(most of) these decisions to the user's choice of JVM version, and this seems much more reasonable
to me than the current status quo.
> I will attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message