lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
Date Fri, 05 Aug 2011 02:13:27 GMT


Robert Muir commented on LUCENE-3358:

It is very unfortunate that the Unicode Consortium somehow ended up with a rule which is,
quite frankly, undesirable.

I'm not concerned about this, while your users may not like it, I think we should stick by
the Standard for these reasons:
# its not desirable to deviate from the standard here, anyone can customize the behavior to
do what they want.
# its not shown that what you say is true, experiments have been done here (see below) and
I would say as a default, what is happening here is just fine.
# splitting this katakana up in some non-standard way leaves me with performance concerns
of long postings lists for common terms.

For the Japanese collection (Table 4), it is not clear whether bigram generation should have
been done for both Kanji and Katakana characters (left part) or only for Kanji characters
(right part of Table 4). When using title-only queries, the Okapi model provided the best
mean average precision of 0.2972 (bigram on Kanji only) compared to 0.2873 when
generating bigrams on both Kanji and Katakana. This difference is rather small, and is even
smaller in the opposite direction for long queries (0.3510 vs. 0.3523). Based on these results
we cannot infer that for the Japanese language one indexing procedure is always significantly
better than another.

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it
to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>                 Key: LUCENE-3358
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>         Attachments: LUCENE-3358.patch, LUCENE-3358.patch
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana,
if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around
the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining
marks found in a single run seem to be lumped into a single token (this is a problem in its
own right, but I'm not sure if it's really a bug.)

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message