Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 277827B2F for ; Fri, 5 Aug 2011 02:13:52 +0000 (UTC) Received: (qmail 12989 invoked by uid 500); 5 Aug 2011 02:13:50 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 12888 invoked by uid 500); 5 Aug 2011 02:13:50 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 12881 invoked by uid 99); 5 Aug 2011 02:13:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2011 02:13:49 +0000 X-ASF-Spam-Status: No, hits=-2000.7 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2011 02:13:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 4D478A90DB for ; Fri, 5 Aug 2011 02:13:27 +0000 (UTC) Date: Fri, 5 Aug 2011 02:13:27 +0000 (UTC) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Message-ID: <1022731053.10239.1312510407313.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <459758293.3874.1312348167485.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-3358?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D130= 79747#comment-13079747 ]=20 Robert Muir commented on LUCENE-3358: ------------------------------------- {quote} It is very unfortunate that the Unicode Consortium somehow ended up with a = rule which is, quite frankly, undesirable. {quote} I'm not concerned about this, while your users may not like it, I think we = should stick by the Standard for these reasons: # its not desirable to deviate from the standard here, anyone can customize= the behavior to do what they want. # its not shown that what you say is true, experiments have been done here = (see below) and I would say as a default, what is happening here is just fi= ne. # splitting this katakana up in some non-standard way leaves me with perfor= mance concerns of long postings lists for common terms. {noformat} For the Japanese collection (Table 4), it is not clear whether bigram gener= ation should have been done for both Kanji and Katakana characters (left part) or only for Ka= nji characters (right part of Table 4). When using title-only queries, the Okapi model pro= vided the best mean average precision of 0.2972 (bigram on Kanji only) compared to 0.2873 = when generating bigrams on both Kanji and Katakana. This difference is rather sm= all, and is even smaller in the opposite direction for long queries (0.3510 vs. 0.3523). Bas= ed on these results we cannot infer that for the Japanese language one indexing procedure is al= ways significantly better than another. {noformat} http://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.1.1.111.6738 > StandardTokenizer disposes of Hiragana combining mark dakuten instead of = attaching it to the character it belongs to > -------------------------------------------------------------------------= ------------------------------------------- > > Key: LUCENE-3358 > URL: https://issues.apache.org/jira/browse/LUCENE-3358 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 3.3 > Reporter: Trejkaz > Assignee: Robert Muir > Fix For: 3.4, 4.0 > > Attachments: LUCENE-3358.patch, LUCENE-3358.patch > > > Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for = tokenising hiragana, if combining marks are in use. > Here's a unit test: > {code} > @Test > public void testHiraganaWithCombiningMarkDakuten() throws Exception > { > // Hiragana 'S' following by the combining mark dakuten > TokenStream stream =3D new StandardTokenizer(Version.LUCENE_33, n= ew StringReader("\u3055\u3099")); > // Should be kept together. > List expectedTokens =3D Arrays.asList("\u3055\u3099"); > List actualTokens =3D new LinkedList(); > CharTermAttribute term =3D stream.addAttribute(CharTermAttribute.= class); > while (stream.incrementToken()) > { > actualTokens.add(term.toString()); > } > assertEquals("Wrong tokens", expectedTokens, actualTokens); > } > {code} > This code fails with: > {noformat} > java.lang.AssertionError: Wrong tokens expected:<[=E3=81=96]> but was:<[= =E3=81=95]> > {noformat} > It seems as if the tokeniser is throwing away the combining mark entirely= . > 3.0's behaviour was also undesirable: > {noformat} > java.lang.AssertionError: Wrong tokens expected:<[=E3=81=96]> but was:<[= =E3=81=95, =E3=82=99]> > {noformat} > But at least the token was there, so it was possible to write a filter to= work around the issue. > Katakana seems to be avoiding this particular problem, because all kataka= na and combining marks found in a single run seem to be lumped into a singl= e token (this is a problem in its own right, but I'm not sure if it's reall= y a bug.) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org