lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: How to do alias(Pinyin) search in Lucene
Date Tue, 15 Dec 2009 13:40:53 GMT
there is an icu transform tokenfilter in the patch here:
http://issues.apache.org/jira/browse/LUCENE-1488

    Transliterator pinyin = Transliterator.getInstance("Han-Latin");
    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
    ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin);
    assertTokenStreamContents(filter, new String[] { "zhōng guó" } );

note it will add tone marks and insert space between syllables by default
if you do not want this, you need to do some cleanup.

    Transliterator pinyin = Transliterator.getInstance("Han-Latin; NFD;
[[:NonspacingMark:][:Space:]] Remove");
    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
    ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin);
    assertTokenStreamContents(filter, new String[] { "zhongguo" } );


2009/12/15 Weiwei Wang <ww.wang.cs@gmail.com>

> Hi, guys,
>     I'm implementing a search engine based on Lucene for Chinese. So I want
> to support pinyin search as Google China do.
>
> e.g.
>    “中国”  means Chinese in English
>    this word's pinyin input is "zhongguo"
> The feature i want to implement is when user type zhongguo the results will
> include documents containing "中国" or even Chinese
>
> Anybody here know how to achieve this?
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message