Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Fri, 5 Aug 2011 02:13:27 +0000 (UTC)
From: "Robert Muir (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: 
 <1022731053.10239.1312510407313.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <459758293.3874.1312348167485.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of
 Hiragana combining mark dakuten instead of attaching it to the character it
 belongs to
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/LUCENE-3358?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D130=
79747#comment-13079747 ]=20

Robert Muir commented on LUCENE-3358:
-------------------------------------

{quote}
It is very unfortunate that the Unicode Consortium somehow ended up with a =
rule which is, quite frankly, undesirable.
{quote}

I'm not concerned about this, while your users may not like it, I think we =
should stick by the Standard for these reasons:
# its not desirable to deviate from the standard here, anyone can customize=
 the behavior to do what they want.
# its not shown that what you say is true, experiments have been done here =
(see below) and I would say as a default, what is happening here is just fi=
ne.
# splitting this katakana up in some non-standard way leaves me with perfor=
mance concerns of long postings lists for common terms.

{noformat}
For the Japanese collection (Table 4), it is not clear whether bigram gener=
ation should have
been done for both Kanji and Katakana characters (left part) or only for Ka=
nji characters
(right part of Table 4). When using title-only queries, the Okapi model pro=
vided the best
mean average precision of 0.2972 (bigram on Kanji only) compared to 0.2873 =
when
generating bigrams on both Kanji and Katakana. This difference is rather sm=
all, and is even
smaller in the opposite direction for long queries (0.3510 vs. 0.3523). Bas=
ed on these results
we cannot infer that for the Japanese language one indexing procedure is al=
ways significantly
better than another.
{noformat}

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.1.1.111.6738

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of =
attaching it to the character it belongs to
> -------------------------------------------------------------------------=
-------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3358.patch, LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for =
tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream =3D new StandardTokenizer(Version.LUCENE_33, n=
ew StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens =3D Arrays.asList("\u3055\u3099");
>         List<String> actualTokens =3D new LinkedList<String>();
>         CharTermAttribute term =3D stream.addAttribute(CharTermAttribute.=
class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[=E3=81=96]> but was:<[=
=E3=81=95]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely=
.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[=E3=81=96]> but was:<[=
=E3=81=95, =E3=82=99]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to=
 work around the issue.
> Katakana seems to be avoiding this particular problem, because all kataka=
na and combining marks found in a single run seem to be lumped into a singl=
e token (this is a problem in its own right, but I'm not sure if it's reall=
y a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org