I think at least the sigram base token could be supported by StandardTokenizer. I also try to implement CJKTokenizer via StandardTokenizer(with sigram support) + BigramFilter. Here is my CJK sigram patch for StandardTokenizer: 57,59c57,59 < //IGNORE_CASE = true; < //BUILD_PARSER = false; < //UNICODE_INPUT = true; --- > //IGNORE_CASE = true; > //BUILD_PARSER = false; > UNICODE_INPUT = true; 62c62 < //DEBUG_TOKEN_MANAGER = true; --- > //DEBUG_TOKEN_MANAGER = true; 92c92 < |)+ > --- > |)+ > 120a121 > | ) > 129c130 < | < #LETTER: // unicode letters --- > | < #LETTER: // alphabets 136c137,141 < "\u0100"-"\u1fff", --- > "\u0100"-"\u1fff" > ] > > > | < #CJK: // non-alphabets > [ 166c171 < --- > 184a190 > token = | Regards Che, Dong ----- Original Message ----- From: "Erik Hatcher" To: "Lucene Developers List" Sent: Saturday, September 27, 2003 8:38 PM Subject: Re: StandardTokenizer CJK Support > Could you add the patch to a Bugzilla issue for easier access? I > don't mind applying it if it has Doug's +1 > > Erik > > > On Friday, September 26, 2003, at 10:39 AM, danrapp@comcast.net wrote: > > > In August of 2002, Che, Dong suggested a change to > > StandardTokenizer.jj that > > would supply some basic support for CJK. (msgNo:2164) A day later, > > Doug gave it > > +1. The suggested change was not added to CVS nor was there any further > > discussion on the mailing list. > > > > I'm working with an application in which certain fields are mixed > > language and > > this change is very useful. Is there a technical reason why this > > change was not > > made? > > > > Regards, > > > > --Dan Rapp > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > >