lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject [contrib]: StandardTokenizer with sigram based CJK Support
Date Tue, 27 Aug 2002 02:56:45 GMT
> Attached  StandardTokenizer.jj with Sigram Based east
> asia language support:
> tested under Windows and GNU/Linux
> 
> Just treat different UnicodeBlock with different word
> segment method. 
> 
> Hope in the future released we can add more language
> support in StandardTokenizer.jj step by step and keep
> it fit for most i18n environment.
> Some common app, like Jive, can use it as default
> Analyser.
> Use localized Analyzier for advanced usage.
> 
> Thank you.
> 
> Che, Dong
>  
> diff StandardTokenizer.jj StandardTokenizer.jj.orig 
> 59c59
> <     UNICODE_INPUT = true;
> ---
> >     //UNICODE_INPUT = true;
> 121d120
> < | <SIGRAM: (<CJK>) >
> 130c129
> < | < #LETTER:                                    //
> alphabets
> ---
> > | < #LETTER:                                    //
> unicode letters
> 137c136,141
> <         "\u0100"-"\u1fff"
> ---
> >         "\u0100"-"\u1fff",
> >         "\u3040"-"\u318f",
> >         "\u3300"-"\u337f",
> >         "\u3400"-"\u3d2d",
> >         "\u4e00"-"\u9fff",
> >         "\uf900"-"\ufaff"
> 140,148d143
> < |  < #CJK:       // non-alphabets
> <       [
> <        "\u3040"-"\u318f",
> <        "\u3300"-"\u337f",
> <        "\u3400"-"\u3d2d",
> <        "\u4e00"-"\u9fff",
> <        "\uf900"-"\ufaff"
> <       ]
> <     >    
> 
> <         token = <SIGRAM> |
> 
> 
> 
> 
> 
> more on unicode standards:
> 
> http://www.unicode.org/charts/normalization/
> http://www.unicode.org/charts/
> 
> http://octopus.cdut.edu.cn/~yf17/oreilly/langref/appa_01.htm
> http://klomp.org/mark/classpath/html/Character_8java-source.html
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Finance - Get real-time stock quotes
> http://finance.yahoo.com
Mime
View raw message