lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
Date Fri, 12 Jun 2009 11:09:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718795#action_12718795
] 

Michael McCandless commented on LUCENE-1545:
--------------------------------------------

bq. but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer
that could help things like this.

Is it possible to conditionalize, at runtime, certain parts of a JFlex grammar?  Ie, with
matchVersion (LUCENE-1684) we could preserve back-compat on this issue, but I'm not sure how
to cleanly push that matchVersion (provided @ runtime to StandardAnalyzer's ctor) "down" into
the grammar so that eg we're not force to make a new full copy of the grammar for each fix.
 (Though perhaps that's an OK solution since it'd make it easy to strongly guarantee back
compat...).

> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN
SMALL LETTRE E
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1545
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>         Environment: Linux x86_64, Sun Java 1.6
>            Reporter: Andreas Hauser
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN
SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character
is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message