Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 98209200BCF for ; Mon, 5 Dec 2016 08:40:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 96BF9160AF9; Mon, 5 Dec 2016 07:40:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DFCD5160B17 for ; Mon, 5 Dec 2016 08:39:59 +0100 (CET) Received: (qmail 26371 invoked by uid 500); 5 Dec 2016 07:39:58 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 26357 invoked by uid 99); 5 Dec 2016 07:39:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2016 07:39:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 913D12C2A66 for ; Mon, 5 Dec 2016 07:39:58 +0000 (UTC) Date: Mon, 5 Dec 2016 07:39:58 +0000 (UTC) From: "peina (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 05 Dec 2016 07:40:00 -0000 [ https://issues.apache.org/jira/browse/LUCENE-7509?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D157= 21497#comment-15721497 ]=20 peina commented on LUCENE-7509: ------------------------------- BTW, is there any chance that https://issues.apache.org/jira/browse/LUCENE-= 7508 will be fixed? > [smartcn] Some chinese text is not tokenized correctly with Chinese punct= uation marks appended > -------------------------------------------------------------------------= --------------------- > > Key: LUCENE-7509 > URL: https://issues.apache.org/jira/browse/LUCENE-7509 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.2.1 > Environment: Mac OS X 10.10 > Reporter: peina > Labels: chinese, tokenization > > Some chinese text is not tokenized correctly with Chinese punctuation mar= ks appended. > e.g. > =E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0 is tokenized as =E7=A2=A7= =E7=BB=BF|=E7=9A=84|=E7=9C=BC=E7=8F=A0. Which is correct. > But=20 > =E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0=EF=BC=8C=EF=BC=88with a Chi= nese punctuation appended )is tokenized as =E7=A2=A7=E7=BB=BF|=E7=9A=84|=E7= =9C=BC|=E7=8F=A0=EF=BC=8C > The similar case happens when text with numbers appended. > e.g. > =E7=94=9F=E6=B4=BB=E6=8A=A58=E6=9C=884=E5=8F=B7 -->=E7=94=9F=E6=B4=BB|=E6= =8A=A5|8|=E6=9C=88|4|=E5=8F=B7 > =E7=94=9F=E6=B4=BB=E6=8A=A5-->=E7=94=9F=E6=B4=BB=E6=8A=A5 > Test Sample: > public static void main(String[] args) throws IOException{ > Analyzer analyzer =3D new SmartChineseAnalyzer(); /* will load stopwo= rds */ > System.out.println("Sample1=3D=3D=3D=3D=3D=3D=3D"); > String sentence =3D "=E7=94=9F=E6=B4=BB=E6=8A=A58=E6=9C=884=E5=8F=B7"= ; > printTokens(analyzer, sentence); > sentence =3D "=E7=94=9F=E6=B4=BB=E6=8A=A5"; > printTokens(analyzer, sentence); > System.out.println("Sample2=3D=3D=3D=3D=3D=3D=3D"); > =20 > sentence =3D "=E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0=EF=BC=8C"= ; > printTokens(analyzer, sentence); > sentence =3D "=E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0"; > printTokens(analyzer, sentence); > =20 > analyzer.close(); > } > private static void printTokens(Analyzer analyzer, String sentence) thr= ows IOException{ > System.out.println("sentence:" + sentence); > TokenStream tokens =3D analyzer.tokenStream("dummyfield", sentence); > tokens.reset(); > CharTermAttribute termAttr =3D (CharTermAttribute) tokens.getAttribut= e(CharTermAttribute.class); > while (tokens.incrementToken()) { > System.out.println(termAttr.toString()); > } > tokens.close(); > } > Output: > Sample1=3D=3D=3D=3D=3D=3D=3D > sentence:=E7=94=9F=E6=B4=BB=E6=8A=A58=E6=9C=884=E5=8F=B7 > =E7=94=9F=E6=B4=BB > =E6=8A=A5 > 8 > =E6=9C=88 > 4 > =E5=8F=B7 > sentence:=E7=94=9F=E6=B4=BB=E6=8A=A5 > =E7=94=9F=E6=B4=BB=E6=8A=A5 > Sample2=3D=3D=3D=3D=3D=3D=3D > sentence:=E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0=EF=BC=8C > =E7=A2=A7=E7=BB=BF > =E7=9A=84 > =E7=9C=BC > =E7=8F=A0 > sentence:=E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0 > =E7=A2=A7=E7=BB=BF > =E7=9A=84 > =E7=9C=BC=E7=8F=A0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org