[ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079727#comment-13079727
]
Trejkaz commented on LUCENE-3358:
---------------------------------
Thanks for such a fast fix! :D (I will still wait for 3.4 because it will make backwards-compat
much simpler.)
I am aware of the Unicode word breaking rules and read the standard through, which is where
I discovered that the non-breaking of Katakana was part of the standard (which is why I haven't
filed it as a bug or improvement about that as well.) It is very unfortunate that the Unicode
Consortium somehow ended up with a rule which is, quite frankly, undesirable. When I brought
the change up with Japanese users, they were 100% against that behaviour, so it's a wonder
that the standard got past the Japanese without any objections (I am, of course, assuming
that they actually consulted an expert in the language.) But breaking it up as a separate
filter isn't so hard. It's only a single Unicode area with few combining marks, so the logic
is not that difficult and StandardTokenizer even marks the token as katakana for us.
> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it
to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-3358
> URL: https://issues.apache.org/jira/browse/LUCENE-3358
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.3
> Reporter: Trejkaz
> Assignee: Robert Muir
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-3358.patch, LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana,
if combining marks are in use.
> Here's a unit test:
> {code}
> @Test
> public void testHiraganaWithCombiningMarkDakuten() throws Exception
> {
> // Hiragana 'S' following by the combining mark dakuten
> TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
> // Should be kept together.
> List<String> expectedTokens = Arrays.asList("\u3055\u3099");
> List<String> actualTokens = new LinkedList<String>();
> CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
> while (stream.incrementToken())
> {
> actualTokens.add(term.toString());
> }
> assertEquals("Wrong tokens", expectedTokens, actualTokens);
> }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around
the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining
marks found in a single run seem to be lumped into a single token (this is a problem in its
own right, but I'm not sure if it's really a bug.)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
|