Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0F8AB11AA4 for ; Tue, 12 Aug 2014 23:57:44 +0000 (UTC) Received: (qmail 63320 invoked by uid 500); 12 Aug 2014 23:57:40 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 63250 invoked by uid 500); 12 Aug 2014 23:57:40 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 63238 invoked by uid 99); 12 Aug 2014 23:57:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Aug 2014 23:57:39 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rik@attivio.com designates 207.46.163.183 as permitted sender) Received: from [207.46.163.183] (HELO na01-bn1-obe.outbound.protection.outlook.com) (207.46.163.183) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Aug 2014 23:57:13 +0000 Received: from CO1PR05MB409.namprd05.prod.outlook.com (10.141.74.153) by CO1PR05MB409.namprd05.prod.outlook.com (10.141.74.153) with Microsoft SMTP Server (TLS) id 15.0.1005.10; Tue, 12 Aug 2014 23:57:08 +0000 Received: from CO1PR05MB409.namprd05.prod.outlook.com ([169.254.3.16]) by CO1PR05MB409.namprd05.prod.outlook.com ([169.254.3.16]) with mapi id 15.00.1005.008; Tue, 12 Aug 2014 23:57:08 +0000 From: Rik Tamm-Daniels To: "solr-user@lucene.apache.org" Subject: Re: ICUTokenizer acting very strangely with oriental characters Thread-Topic: ICUTokenizer acting very strangely with oriental characters Thread-Index: AQHPtndVU/g6E9mGYUOO2fArqaMe4pvNlV0AgAAP8xY= Date: Tue, 12 Aug 2014 23:57:07 +0000 Message-ID: References: <53EA8BD1.3060202@elyograg.org>,<53EA9C72.80002@elyograg.org> In-Reply-To: <53EA9C72.80002@elyograg.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [76.28.59.17] x-microsoft-antispam: BCL:0;PCL:0;RULEID:;UriScan:; x-forefront-prvs: 0301360BF5 x-forefront-antispam-report: SFV:NSPM;SFS:(164054003)(199003)(189002)(106116001)(81542001)(105586002)(66066001)(76176999)(106356001)(76482001)(15975445006)(77982001)(110136001)(99286002)(50986999)(107046002)(2351001)(108616004)(107886001)(92566001)(20776003)(54356999)(95666004)(79102001)(86362001)(64706001)(80022001)(19625215002)(101416001)(81342001)(2656002)(87936001)(76576001)(74316001)(85852003)(83072002)(85306004)(99396002)(74662001)(77096002)(46102001)(21056001)(33646002)(74502001)(19617315012)(16236675004)(83322001)(4396001)(31966008)(19580405001)(19580395003)(24736002);DIR:OUT;SFP:;SCL:1;SRVR:CO1PR05MB409;H:CO1PR05MB409.namprd05.prod.outlook.com;FPR:;MLV:sfv;PTR:InfoNoRecords;A:1;MX:1;LANG:en; Content-Type: multipart/alternative; boundary="_000_e84a4e9aafaa4ebabc5d32b0c65422d3CO1PR05MB409namprd05pro_" MIME-Version: 1.0 X-OriginatorOrg: attivio.com X-Virus-Checked: Checked by ClamAV on apache.org --_000_e84a4e9aafaa4ebabc5d32b0c65422d3CO1PR05MB409namprd05pro_ Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable mmn jnbbbjb)nkkkk9nooooooon Sent from my HTC ----- Reply message ----- From: "Shawn Heisey" To: "solr-user@lucene.apache.org" Subject: ICUTokenizer acting very strangely with oriental characters Date: Tue, Aug 12, 2014 19:00 See the original message on this thread for full details. Some additional information: This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot showing the analysis problem in more detail. The first line you can see is the ICUTokenizer. https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png The original field value was: =1B$B#2#0@$5*$N#1#0#0?M=1B(B;=1B$B%]!<%H%l!<%H%"!<%+%$%V%9=1B(B;=1B$B@/<#2H= !&73?M=1B(B;=1B$B@/<#2H!&;XF3=1B(B =1B$B