lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Ferenczi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars
Date Fri, 05 Oct 2018 19:56:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640282#comment-16640282
] 

Jim Ferenczi commented on LUCENE-8526:
--------------------------------------

Sounds great [~steve_rowe]. I'll prepare a patch.

> StandardTokenizer doesn't separate hangul characters from other non-CJK chars
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-8526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8526
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> It was first reported here https://github.com/elastic/elasticsearch/issues/34285.
> I don't know if it's the expected behavior but the StandardTokenizer does not split words
> which are composed of a mixed of non-CJK characters and hangul syllabs. For instance
"한국2018" or "한국abc" is kept as is by this tokenizer and mark as an alpha-numeric group.
This breaks the CJKBigram token filter which will not build bigrams on such groups. The other
CJK characters are correctly splitted when they are mixed with other alphabet so I'd expect
the same for hangul.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message