lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject Re: StandardTokenizer CJK Support
Date Sat, 27 Sep 2003 17:57:49 GMT
I think at least the sigram base token could be supported by StandardTokenizer.
I also try to implement CJKTokenizer via StandardTokenizer(with sigram support) + BigramFilter.

Here is my CJK sigram patch for StandardTokenizer:
57,59c57,59
< //IGNORE_CASE = true;
< //BUILD_PARSER = false;
< //UNICODE_INPUT = true;
---
>     //IGNORE_CASE = true;
>     //BUILD_PARSER = false;
>     UNICODE_INPUT = true;
62c62
< //DEBUG_TOKEN_MANAGER = true;
---
>     //DEBUG_TOKEN_MANAGER = true;
92c92
<   <ALPHANUM: (<LETTER>|<DIGIT>)+ >
---
> <ALPHANUM: (<LETTER>|<DIGIT>)+ >
120a121
> | <SIGRAM: (<CJK>) >
129c130
< | < #LETTER:                                    // unicode letters
---
> | < #LETTER:                                    // alphabets
136c137,141
<        "\u0100"-"\u1fff",
---
>         "\u0100"-"\u1fff"
>     ]
>     >
> |  < #CJK:       // non-alphabets
>       [
166c171
<  <NOISE: ~[] >
---
> <NOISE: ~[] >
184a190
>         token = <SIGRAM> |


Regards 

Che, Dong

----- Original Message ----- 
From: "Erik Hatcher" <erik@ehatchersolutions.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Saturday, September 27, 2003 8:38 PM
Subject: Re: StandardTokenizer CJK Support


> Could you add the patch to a Bugzilla issue for easier access?   I 
> don't mind applying it if it has Doug's +1
> 
> Erik
> 
> 
> On Friday, September 26, 2003, at 10:39  AM, danrapp@comcast.net wrote:
> 
> > In August of 2002, Che, Dong suggested a change to 
> > StandardTokenizer.jj that
> > would supply some basic support for CJK. (msgNo:2164) A day later, 
> > Doug gave it
> > +1. The suggested change was not added to CVS nor was there any further
> > discussion on the mailing list.
> >
> > I'm working with an application in which certain fields are mixed 
> > language and
> > this change is very useful. Is there a technical reason why this 
> > change was not
> > made?
> >
> > Regards,
> >
> > --Dan Rapp
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
Mime
View raw message