lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject Re: StandardTokenizer CJK Support
Date Sun, 28 Sep 2003 06:51:09 GMT
Attached with CJK sigram support:


Che, Dong
----- Original Message ----- 
From: "Erik Hatcher" <erik@ehatchersolutions.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Sunday, September 28, 2003 6:53 AM
Subject: Re: StandardTokenizer CJK Support


> If Doug or someone else gives a fresh +1 to this I'll apply it if they 
> don't first.  I need patches as file attachments so they don't get 
> mangled in e-mail formatting though - best through Bugzilla so it 
> doesn't get lost in the shuffle again too.
> 
> Erik
> 
> On Saturday, September 27, 2003, at 01:57  PM, Che Dong wrote:
> 
> > I think at least the sigram base token could be supported by 
> > StandardTokenizer.
> > I also try to implement CJKTokenizer via StandardTokenizer(with sigram 
> > support) + BigramFilter.
> >
> > Here is my CJK sigram patch for StandardTokenizer:
> > 57,59c57,59
> > < //IGNORE_CASE = true;
> > < //BUILD_PARSER = false;
> > < //UNICODE_INPUT = true;
> > ---
> >>     //IGNORE_CASE = true;
> >>     //BUILD_PARSER = false;
> >>     UNICODE_INPUT = true;
> > 62c62
> > < //DEBUG_TOKEN_MANAGER = true;
> > ---
> >>     //DEBUG_TOKEN_MANAGER = true;
> > 92c92
> > <   <ALPHANUM: (<LETTER>|<DIGIT>)+ >
> > ---
> >> <ALPHANUM: (<LETTER>|<DIGIT>)+ >
> > 120a121
> >> | <SIGRAM: (<CJK>) >
> > 129c130
> > < | < #LETTER:                                    // unicode letters
> > ---
> >> | < #LETTER:                                    // alphabets
> > 136c137,141
> > <        "\u0100"-"\u1fff",
> > ---
> >>         "\u0100"-"\u1fff"
> >>     ]
> >>>
> >> |  < #CJK:       // non-alphabets
> >>       [
> > 166c171
> > <  <NOISE: ~[] >
> > ---
> >> <NOISE: ~[] >
> > 184a190
> >>         token = <SIGRAM> |
> >
> >
> > Regards
> >
> > Che, Dong
> >
> > ----- Original Message -----
> > From: "Erik Hatcher" <erik@ehatchersolutions.com>
> > To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> > Sent: Saturday, September 27, 2003 8:38 PM
> > Subject: Re: StandardTokenizer CJK Support
> >
> >
> >> Could you add the patch to a Bugzilla issue for easier access?   I
> >> don't mind applying it if it has Doug's +1
> >>
> >> Erik
> >>
> >>
> >> On Friday, September 26, 2003, at 10:39  AM, danrapp@comcast.net 
> >> wrote:
> >>
> >>> In August of 2002, Che, Dong suggested a change to
> >>> StandardTokenizer.jj that
> >>> would supply some basic support for CJK. (msgNo:2164) A day later,
> >>> Doug gave it
> >>> +1. The suggested change was not added to CVS nor was there any 
> >>> further
> >>> discussion on the mailing list.
> >>>
> >>> I'm working with an application in which certain fields are mixed
> >>> language and
> >>> this change is very useful. Is there a technical reason why this
> >>> change was not
> >>> made?
> >>>
> >>> Regards,
> >>>
> >>> --Dan Rapp
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> >>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> >> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message