lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pichai Ongvasith <>
Subject Thai analyzer
Date Wed, 18 Feb 2004 00:46:04 GMT

I have written some simple adaptor/wrapper classes for
java.text.BreakIterator, available in jdk 1.4 and
later. I also created a ThaiAnalyzer class based on
those wrappers.

Thai is one of those languages that has no whitespace
between words. Because of this, Lucene
StandardTokenizer can't tokenize a Thai sentence and
return the whole sentence as a token. 

JDK 1.4 comes with a simple dictionary based tokenizer
for Thai. With the wrappers, I can use Thai
BreakIterator to tokenize Thai sentences returned from

My design is quite simple. I added <THAI> tag to
StandardTokenizer.jj (I rename it to
TestStandardTokenizer.jj in my test). The
StandardTokenizer then returns a Thai sentence with
the tag <THAI>, among other ordinary tokens. Then
BreakIteratorTokenTokenizer detects the token and
further breaks it down into smaller tokens, which
represent actual Thai words.

The source code is available here

I'm not sure if this code is worth being part of
Lucene. If it is, I can modify the code as you guys
suggest, and contribute it to Lucene project.


Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message