lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nattapong Sirilappanich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4253) ThaiAnalyzer fail to tokenize word.
Date Fri, 27 Jul 2012 03:24:35 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423651#comment-13423651
] 

Nattapong Sirilappanich commented on LUCENE-4253:
-------------------------------------------------

Hi Reobrt,

Stop words will only be useful when it is able to deal with correct tokenization.
The problem, as stated in the thesis, is the tokenization process can never give a 100% correct
result by any todate technology.

I'd give it a try for the approach in the thesis but it'd be risky if it doesn't deliver what
it promised in thesis.
My preference now is to use no stop word at all to avoid potential problems.

An example problem is a word "คงอยู่" (Two syllables Thai word mean persisting
and surviving).
It will be segmented into "คง" (mean may, might and probably in English) and "อยู่"
(mean stay, live and reside in English). By using the existing stop word, there is no way
to find this word. By using the new stop words in the thesis, the term "คง" is the only
way to find the word which is not going to make sense. How come the word which mean "might"
return a result with the word meaning "survive" ?
                
> ThaiAnalyzer fail to tokenize word.
> -----------------------------------
>
>                 Key: LUCENE-4253
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4253
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: Realtime Branch
>         Environment: Windows 7 SP1.
> Java 1.7.0-b147
>            Reporter: Nattapong Sirilappanich
>
> Method 
> protected TokenStreamComponents createComponents(String,Reader)
> return a component that unable to tokenize Thai word.
> The current return statement is:
> return new TokenStreamComponents(source, new StopFilter(matchVersion,        result,
stopwords));
> My experiment is change the return statement to:
> return new TokenStreamComponents(source, result);
> It give me a correct result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message