lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject Re: [lucy-user] Chinese support?
Date Sat, 18 Feb 2017 14:46:04 GMT
On 18/02/2017 07:22, Hao Wu wrote:
> Thanks. Get it work.

Lucy's StandardTokenizer breaks up the text at the word boundaries defined in 
Unicode Standard Annex #29. Then we treat every Alphabetic character that 
doesn't have a Word_Break property as a single term. These are characters that 
match \p{Ideographic}, \p{Script: Hiragana}, or \p{Line_Break: 
Complex_Context}. This should work for Chinese but as Peter mentioned, we 
don't support n-grams.

If you're using QueryParser, you're likely to run into problems, though. 
QueryParser will turn a sequence of Chinese characters into a PhraseQuery 
which is obviously wrong. A quick hack is to insert a space after every 
Chinese character before passing a query string to QueryParser:

     $query_string =~ s/\p{Ideographic}/$& /g;


View raw message