lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "peina (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended
Date Mon, 05 Dec 2016 07:39:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721497#comment-15721497
] 

peina commented on LUCENE-7509:
-------------------------------

BTW, is there any chance that https://issues.apache.org/jira/browse/LUCENE-7508 will be fixed?

> [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks
appended
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7509
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.2.1
>         Environment: Mac OS X 10.10
>            Reporter: peina
>              Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
>     Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
>     System.out.println("Sample1=======");
>     String sentence = "生活报8月4号";
>     printTokens(analyzer, sentence);
>     sentence = "生活报";
>     printTokens(analyzer, sentence);
>     System.out.println("Sample2=======");
>     
>     sentence = "碧绿的眼珠,";
>     printTokens(analyzer, sentence);
>     sentence = "碧绿的眼珠";
>     printTokens(analyzer, sentence);
>     
>     analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
>     System.out.println("sentence:" + sentence);
>     TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
>     tokens.reset();
>     CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
>     while (tokens.incrementToken()) {
>       System.out.println(termAttr.toString());
>     }
>     tokens.close();
>   }
> Output:
> Sample1=======
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2=======
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message