lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?
Date Wed, 23 Sep 2015 15:04:52 GMT
In a word, no. The CJK languages in general don't
necessarily tokenize on whitespace so using a tokenizer
that uses whitespace as it's default tokenizer simply won't
work.

Have you tried it? It seems a simple test would get you
an answer faster.

Best,
Erick

On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi,
>
> Would like to check, will StandardTokenizerFactory works well for indexing
> both English and Chinese (Bilingual) documents, or do we need tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>
>
> Regards,
> Edwin
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message