lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4253) ThaiAnalyzer fail to tokenize word.
Date Fri, 27 Jul 2012 03:40:35 GMT


Robert Muir commented on LUCENE-4253:

Right but having less than 100% segmentation isnt unique to thai (it happens in many other
languages too).

Its always a tradeoff: if those measurements are correct and 30% of typical thai text is stopwords,
then its a pretty significant performance (and often relevance) degradation to keep all stopwords.

In general these list are useful, someone can also choose to use them with commongrams filter
for maybe 
an even better tradeoff. Thats why I think its good to keep them (of course as short and minimal
as possible).

If someone doesnt mind the downsides, you can always pass CharArraySet.EMPTY_SET parameter
as I mentioned before.
> ThaiAnalyzer fail to tokenize word.
> -----------------------------------
>                 Key: LUCENE-4253
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: Realtime Branch
>         Environment: Windows 7 SP1.
> Java 1.7.0-b147
>            Reporter: Nattapong Sirilappanich
> Method 
> protected TokenStreamComponents createComponents(String,Reader)
> return a component that unable to tokenize Thai word.
> The current return statement is:
> return new TokenStreamComponents(source, new StopFilter(matchVersion,        result,
> My experiment is change the return statement to:
> return new TokenStreamComponents(source, result);
> It give me a correct result.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message