lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerome Lanneluc <jerome_lanne...@fr.ibm.com>
Subject Chinese analyzer
Date Thu, 24 Jan 2013 14:25:58 GMT
Hi,

I'm using the 3.6.1 Chinese analyzer and when tokenizing some Chinese 
words containing CJK Unified Ideographs Extension B characters, the 
resulting tokens do not contain the original words. Instead it seems that 
the CJK Unified Ideographs Extension B characters are split in two 
characters.

In the attached example, 
the output is:

Sentence: 我是中国人(25105 26159 20013 22269 20154)
Tokens: [我(25105) 是(26159) 中国(20013 22269) 人(20154) ]

Sentence: ?(55401 57046)
Tokens: [?(55401) ?(57046) ]

Note the 2 tokens in the second sample when I would expect to have only 
one token with the (55401 57046) characters.

I could not figure out if I'm doing something wrong, or if this is a bug 
in the Chinese analyzer.

Thanks,
Jerome



Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A 
Mime
View raw message