lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: How to do alias(Pinyin) search in Lucene
Date Wed, 16 Dec 2009 03:35:08 GMT
hello, this was mainly to show you a quick-and-dirty way to solve the
problem.

if you have a lot of text, here are some ways to optimize:
1. the 'cleanup' step I showed you, is extremely inefficient way to remove
the space and diacritics.
For your case perhaps you can use more efficient ways to avoid the
normalization (NFD), such as asciifoldingfilter to remove the diacritics.
2. if most of your text is not chinese, you can use a filter in your spec to
ensure that it only operates on chinese text... i think something like this
should work: instead of Transliterator.getInstance("Han-Latin") it would be
Transliterator.getInstance("[:Ideographic:] Han-Latin"). Such a thing cannot
be automatically determined (see the comments in the code) because this is a
composite transform, it calls the Han-SpacedHan to add the space.
3. if you really care about speed, you can take the rules yourself from CLDR
and customize them to your needs (for example, remove diacritics from the
rules and/or do not call Han-SpacedHan which adds the spaces. This would
also probably automatically fix #2 above. The rules are based on this XML
file: http://unicode.org/repos/cldr/trunk/common/transforms/Han-Latin.xml,
but you need to preprocess it to be plain text to work as a custom
transform: see here for more details
http://userguide.icu-project.org/transforms/general



2009/12/15 Weiwei Wang <ww.wang.cs@gmail.com>

> Finally, i make it run, however, it works so slow
>
> 2009/12/15 Weiwei Wang <ww.wang.cs@gmail.com>
>
> > got it, thanks, Robert
> >
> >
> > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> >>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
> >> /path/to/LUCENE-XXYY.patch from the lucene source code root directory...
> >> it
> >> should create the necessary directory and files.
> >> then run 'ant' , in this case it should create a lucene-icu jar file in
> >> the
> >> build directory.
> >>
> >> the patch doesnt include the icu dependency itself so you need to get
> that
> >> jar file from www.icu-project.org and have it in your classpath also
> >>
> >> sorry for the trouble, hope to integrate some of this soon for a future
> >> release.
> >>
> >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww.wang.cs@gmail.com>
> >> wrote:
> >>
> >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
> >> directory
> >> > in my dowloaded contrib directory.
> >> >
> >> > I'm a rookie guy using patch, i'm currently in the contrib dir, could
> >> > anybody tell me how to execute this patch command to generate the
> >> relevant
> >> > dir and souce files?
> >> >
> >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >
> >> > > look at the latest patch file attached to the issue, it should work
> >> with
> >> > > lucene 2.9 or greater (I think)
> >> > >
> >> > > 2009/12/15 Weiwei Wang <ww.wang.cs@gmail.com>
> >> > >
> >> > > > where can i find the source code?
> >> > > >
> >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rcmuir@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > there is an icu transform tokenfilter in the patch here:
> >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
> >> > > > >
> >> > > > >    Transliterator pinyin =
> >> Transliterator.getInstance("Han-Latin");
> >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> >> > StringReader("中国"));
> >> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> >> > > pinyin);
> >> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng
guó"
> }
> >> );
> >> > > > >
> >> > > > > note it will add tone marks and insert space between syllables
> by
> >> > > default
> >> > > > > if you do not want this, you need to do some cleanup.
> >> > > > >
> >> > > > >    Transliterator pinyin =
> Transliterator.getInstance("Han-Latin;
> >> > NFD;
> >> > > > > [[:NonspacingMark:][:Space:]] Remove");
> >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> >> > StringReader("中国"));
> >> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> >> > > pinyin);
> >> > > > >    assertTokenStreamContents(filter, new String[] { "zhongguo"
}
> >> );
> >> > > > >
> >> > > > >
> >> > > > > 2009/12/15 Weiwei Wang <ww.wang.cs@gmail.com>
> >> > > > >
> >> > > > > > Hi, guys,
> >> > > > > >     I'm implementing a search engine based on Lucene
for
> >> Chinese.
> >> > So
> >> > > I
> >> > > > > want
> >> > > > > > to support pinyin search as Google China do.
> >> > > > > >
> >> > > > > > e.g.
> >> > > > > >    “中国”  means Chinese in English
> >> > > > > >    this word's pinyin input is "zhongguo"
> >> > > > > > The feature i want to implement is when user type zhongguo
the
> >> > > results
> >> > > > > will
> >> > > > > > include documents containing "中国" or even Chinese
> >> > > > > >
> >> > > > > > Anybody here know how to achieve this?
> >> > > > > >
> >> > > > > > --
> >> > > > > > Weiwei Wang
> >> > > > > > Alex Wang
> >> > > > > > 王巍巍
> >> > > > > > Room 403, Mengmin Wei Building
> >> > > > > > Computer Science Department
> >> > > > > > Gulou Campus of Nanjing University
> >> > > > > > Nanjing, P.R.China, 210093
> >> > > > > >
> >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Robert Muir
> >> > > > > rcmuir@gmail.com
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Weiwei Wang
> >> > > > Alex Wang
> >> > > > 王巍巍
> >> > > > Room 403, Mengmin Wei Building
> >> > > > Computer Science Department
> >> > > > Gulou Campus of Nanjing University
> >> > > > Nanjing, P.R.China, 210093
> >> > > >
> >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Robert Muir
> >> > > rcmuir@gmail.com
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Weiwei Wang
> >> > Alex Wang
> >> > 王巍巍
> >> > Room 403, Mengmin Wei Building
> >> > Computer Science Department
> >> > Gulou Campus of Nanjing University
> >> > Nanjing, P.R.China, 210093
> >> >
> >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> >
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message