lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chang KaiShin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended
Date Fri, 02 Dec 2016 05:10:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714052#comment-15714052
] 

Chang KaiShin commented on LUCENE-7509:
---------------------------------------

This is not a bug. The underlying Viterbi algorithm segmenting Chinese sentences is based
on the probability of the occurrences of the Chinese Characters. Take sentence "生活报8月4号"
as an example. The "报" here is meant 2 meanings. If it is placed in the end of the sentence.
It means daily newspaper. However, if placed with conjunctions with other Chinese Characters.
It is meant to report something. So the algorithm segments "报" as independent word to mean
reporting. On the Contrary,  "生活报" is assumed to have higher chance to mean daily newspaper.
You need to add some words to the dictionary to let the algorithms to learn, so that you get
the correct result you wanted. 

> [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks
appended
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7509
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.2.1
>         Environment: Mac OS X 10.10
>            Reporter: peina
>              Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
>     Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
>     System.out.println("Sample1=======");
>     String sentence = "生活报8月4号";
>     printTokens(analyzer, sentence);
>     sentence = "生活报";
>     printTokens(analyzer, sentence);
>     System.out.println("Sample2=======");
>     
>     sentence = "碧绿的眼珠,";
>     printTokens(analyzer, sentence);
>     sentence = "碧绿的眼珠";
>     printTokens(analyzer, sentence);
>     
>     analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
>     System.out.println("sentence:" + sentence);
>     TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
>     tokens.reset();
>     CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
>     while (tokens.incrementToken()) {
>       System.out.println(termAttr.toString());
>     }
>     tokens.close();
>   }
> Output:
> Sample1=======
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2=======
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message