lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 27182] - [PATCH] Thai Analysis Enhancement
Date Tue, 24 Feb 2004 13:13:53 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27182>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27182

[PATCH] Thai Analysis Enhancement





------- Additional Comments From otis@apache.org  2004-02-24 13:13 -------
A note from the original contribution email, for archiving purposes:

Thai is one of those languages that has no whitespace
between words. Because of this, Lucene
StandardTokenizer can't tokenize a Thai sentence and
return the whole sentence as a token. 

JDK 1.4 comes with a simple dictionary based tokenizer
for Thai. With the wrappers, I can use Thai
BreakIterator to tokenize Thai sentences returned from
StdTokenizer.

My design is quite simple. I added <THAI> tag to
StandardTokenizer.jj (I rename it to
TestStandardTokenizer.jj in my test). The
StandardTokenizer then returns a Thai sentence with
the tag <THAI>, among other ordinary tokens. Then
BreakIteratorTokenTokenizer detects the token and
further breaks it down into smaller tokens, which
represent actual Thai words.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message