lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Tokenizing Chinese & multi-language search
Date Wed, 16 Mar 2011 03:51:13 GMT
Hi Andy,

Is the "I don't know what language the query is in" something you could change 
- asking the user
- deriving from HTTP request headers
- identifying the query language (if queries are long enough and "texty")
- ...

Sematext :: :: Solr - Lucene - Nutch
Lucene ecosystem search ::

----- Original Message ----
> From: Andy <>
> To:
> Sent: Tue, March 15, 2011 9:07:36 PM
> Subject: Tokenizing Chinese & multi-language search
> Hi,
> I remember reading in this list a while ago that Solr will only  tokenize on 
>whitespace even when using CJKAnalyzer. That would make Solr  unusable on 
>Chinese or any other languages that don't use whitespace as  separator.
> 1) I remember reading about a workaround. Unfortunately I  can't find the post 
>that mentioned it. Could someone give me pointers on how to  address this issue?
> 2) Let's say I have fixed this issue and have  properly analyzed and indexed 
>the Chinese documents. My documents are in  multiple languages. I plan to use 
>separate fields for documents in different  languages: text_en, text_zh, 
>text_ja, text_fr, etc. Each field will be  associated with the appropriate 
> My problem now is how to deal with  the query string. I don't know what 
>language the query is in, so I won't be able  to select the appropriate analyzer 
>for the query string. If I just use the  standard analyzer on the query string, 
>any query that's in Chinese won't be  tokenized correctly. So would the whole 
>system still work in this  case?
> This must be a pretty common use case, handling multi-language  search. What is 
>the recommended way of dealing with this  problem?
> Thanks.
> Andy

View raw message