Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Mon, 5 Dec 2016 07:39:58 +0000 (UTC)
From: "peina (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <JIRA.13013740.1476932980000.436361.1480923598591@Atlassian.JIRA>
In-Reply-To: <JIRA.13013740.1476932980000@Atlassian.JIRA>
References: <JIRA.13013740.1476932980000@Atlassian.JIRA> <JIRA.13013740.1476932980641@arcas>
Subject: [jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not
 tokenized correctly with Chinese punctuation marks appended
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Mon, 05 Dec 2016 07:40:00 -0000


    [ https://issues.apache.org/jira/browse/LUCENE-7509?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D157=
21497#comment-15721497 ]=20

peina commented on LUCENE-7509:
-------------------------------

BTW, is there any chance that https://issues.apache.org/jira/browse/LUCENE-=
7508 will be fixed?

> [smartcn] Some chinese text is not tokenized correctly with Chinese punct=
uation marks appended
> -------------------------------------------------------------------------=
---------------------
>
>                 Key: LUCENE-7509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7509
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.2.1
>         Environment: Mac OS X 10.10
>            Reporter: peina
>              Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation mar=
ks appended.
> e.g.
> =E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0 is tokenized as =E7=A2=A7=
=E7=BB=BF|=E7=9A=84|=E7=9C=BC=E7=8F=A0. Which is correct.
> But=20
> =E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0=EF=BC=8C=EF=BC=88with a Chi=
nese punctuation appended )is tokenized as =E7=A2=A7=E7=BB=BF|=E7=9A=84|=E7=
=9C=BC|=E7=8F=A0=EF=BC=8C
> The similar case happens when text with numbers appended.
> e.g.
> =E7=94=9F=E6=B4=BB=E6=8A=A58=E6=9C=884=E5=8F=B7 -->=E7=94=9F=E6=B4=BB|=E6=
=8A=A5|8|=E6=9C=88|4|=E5=8F=B7
> =E7=94=9F=E6=B4=BB=E6=8A=A5-->=E7=94=9F=E6=B4=BB=E6=8A=A5
> Test Sample:
> public static void main(String[] args) throws IOException{
>     Analyzer analyzer =3D new SmartChineseAnalyzer(); /* will load stopwo=
rds */
>     System.out.println("Sample1=3D=3D=3D=3D=3D=3D=3D");
>     String sentence =3D "=E7=94=9F=E6=B4=BB=E6=8A=A58=E6=9C=884=E5=8F=B7"=
;
>     printTokens(analyzer, sentence);
>     sentence =3D "=E7=94=9F=E6=B4=BB=E6=8A=A5";
>     printTokens(analyzer, sentence);
>     System.out.println("Sample2=3D=3D=3D=3D=3D=3D=3D");
>    =20
>     sentence =3D "=E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0=EF=BC=8C"=
;
>     printTokens(analyzer, sentence);
>     sentence =3D "=E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0";
>     printTokens(analyzer, sentence);
>    =20
>     analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) thr=
ows IOException{
>     System.out.println("sentence:" + sentence);
>     TokenStream tokens =3D analyzer.tokenStream("dummyfield", sentence);
>     tokens.reset();
>     CharTermAttribute termAttr =3D (CharTermAttribute) tokens.getAttribut=
e(CharTermAttribute.class);
>     while (tokens.incrementToken()) {
>       System.out.println(termAttr.toString());
>     }
>     tokens.close();
>   }
> Output:
> Sample1=3D=3D=3D=3D=3D=3D=3D
> sentence:=E7=94=9F=E6=B4=BB=E6=8A=A58=E6=9C=884=E5=8F=B7
> =E7=94=9F=E6=B4=BB
> =E6=8A=A5
> 8
> =E6=9C=88
> 4
> =E5=8F=B7
> sentence:=E7=94=9F=E6=B4=BB=E6=8A=A5
> =E7=94=9F=E6=B4=BB=E6=8A=A5
> Sample2=3D=3D=3D=3D=3D=3D=3D
> sentence:=E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0=EF=BC=8C
> =E7=A2=A7=E7=BB=BF
> =E7=9A=84
> =E7=9C=BC
> =E7=8F=A0
> sentence:=E7=A2=A7=E7=BB=BF=E7=9A=84=E7=9C=BC=E7=8F=A0
> =E7=A2=A7=E7=BB=BF
> =E7=9A=84
> =E7=9C=BC=E7=8F=A0


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org