lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: StandardTokenizer CJK Support
Date Sat, 27 Sep 2003 22:53:14 GMT
If Doug or someone else gives a fresh +1 to this I'll apply it if they 
don't first.  I need patches as file attachments so they don't get 
mangled in e-mail formatting though - best through Bugzilla so it 
doesn't get lost in the shuffle again too.

	Erik

On Saturday, September 27, 2003, at 01:57  PM, Che Dong wrote:

> I think at least the sigram base token could be supported by 
> StandardTokenizer.
> I also try to implement CJKTokenizer via StandardTokenizer(with sigram 
> support) + BigramFilter.
>
> Here is my CJK sigram patch for StandardTokenizer:
> 57,59c57,59
> < //IGNORE_CASE = true;
> < //BUILD_PARSER = false;
> < //UNICODE_INPUT = true;
> ---
>>     //IGNORE_CASE = true;
>>     //BUILD_PARSER = false;
>>     UNICODE_INPUT = true;
> 62c62
> < //DEBUG_TOKEN_MANAGER = true;
> ---
>>     //DEBUG_TOKEN_MANAGER = true;
> 92c92
> <   <ALPHANUM: (<LETTER>|<DIGIT>)+ >
> ---
>> <ALPHANUM: (<LETTER>|<DIGIT>)+ >
> 120a121
>> | <SIGRAM: (<CJK>) >
> 129c130
> < | < #LETTER:                                    // unicode letters
> ---
>> | < #LETTER:                                    // alphabets
> 136c137,141
> <        "\u0100"-"\u1fff",
> ---
>>         "\u0100"-"\u1fff"
>>     ]
>>>
>> |  < #CJK:       // non-alphabets
>>       [
> 166c171
> <  <NOISE: ~[] >
> ---
>> <NOISE: ~[] >
> 184a190
>>         token = <SIGRAM> |
>
>
> Regards
>
> Che, Dong
>
> ----- Original Message -----
> From: "Erik Hatcher" <erik@ehatchersolutions.com>
> To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> Sent: Saturday, September 27, 2003 8:38 PM
> Subject: Re: StandardTokenizer CJK Support
>
>
>> Could you add the patch to a Bugzilla issue for easier access?   I
>> don't mind applying it if it has Doug's +1
>>
>> Erik
>>
>>
>> On Friday, September 26, 2003, at 10:39  AM, danrapp@comcast.net 
>> wrote:
>>
>>> In August of 2002, Che, Dong suggested a change to
>>> StandardTokenizer.jj that
>>> would supply some basic support for CJK. (msgNo:2164) A day later,
>>> Doug gave it
>>> +1. The suggested change was not added to CVS nor was there any 
>>> further
>>> discussion on the mailing list.
>>>
>>> I'm working with an application in which certain fields are mixed
>>> language and
>>> this change is very useful. Is there a technical reason why this
>>> change was not
>>> made?
>>>
>>> Regards,
>>>
>>> --Dan Rapp
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>


Mime
View raw message