lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?
Date Wed, 23 Sep 2015 15:04:52 GMT
In a word, no. The CJK languages in general don't
necessarily tokenize on whitespace so using a tokenizer
that uses whitespace as it's default tokenizer simply won't

Have you tried it? It seems a simple test would get you
an answer faster.


On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo <>

> Hi,
> Would like to check, will StandardTokenizerFactory works well for indexing
> both English and Chinese (Bilingual) documents, or do we need tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
> Regards,
> Edwin

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message