lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8325) smartcn analyzer can't handle SURROGATE char
Date Wed, 23 May 2018 14:47:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487387#comment-16487387
] 

Uwe Schindler commented on LUCENE-8325:
---------------------------------------

Thanks! Great. :-)

> smartcn analyzer can't handle SURROGATE char
> --------------------------------------------
>
>                 Key: LUCENE-8325
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8325
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: chengpohi
>            Priority: Minor
>              Labels: newbie, patch
>             Fix For: 7.4, master (8.0)
>
>         Attachments: handle_surrogate_char_for_smartcn_2018-05-23.patch
>
>
> This issue is from [https://github.com/elastic/elasticsearch/issues/30739]
> smartcn analyzer can't handle SURROGATE char, Example:
>  
>  
> {code:java}
> Analyzer ca = new SmartChineseAnalyzer(); 
> String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char 
> TokenStream tokenStream = ca.tokenStream("", sentence); 
> CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

> tokenStream.reset(); 
> while (tokenStream.incrementToken()) { 
>     String term = charTermAttribute.toString(); 
>     System.out.println(term); 
> } 
> {code}
>  
> In the above code snippet will output: 
>  
> {code:java}
> ? 
> ? 
> {code}
>  
>  and I have created a *PATCH* to try to fix this, please help review(since *smartcn*
only support *GBK* char, so it's only just handle it as a *single char*).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message