lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3366) StandardFilter only works with ClassicTokenizer and only when version < 3.1
Date Tue, 09 Aug 2011 02:18:27 GMT


Robert Muir commented on LUCENE-3366:

the purpose of the filter is "Normalizes tokens extracted with StandardTokenizer".

currently this is a no-op, but we can always improve it going with the spirit of the whole
standard this thing implements.

The TODO currently refers to this statement:
"For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use spaces between
words, a good implementation should not depend on the default word boundary specification.
It should use a more sophisticated mechanism ... Ideographic scripts such as Japanese and
Chinese are even more complex"

There is no problem having a TODO in this filter, we don't need to do a rush job for any reason...

Some of the preparation for this (e.g. improving the default behavior for CJK) was already
done in LUCENE-2911. We now tag all these special types,
so in the meantime if someone wants to do their own downstream processing they can do this

> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>                 Key: LUCENE-3366
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: David Smiley
> The StandardFilter used to remove periods from acronyms and apostrophes-S's where they
occurred. And it used to work in conjunction with the StandardTokenizer.  Presently, it only
does this with ClassicTokenizer and when the lucene match version is before 3.1. Here is a
excerpt from the code:
> {code:lang=java}
>   public final boolean incrementToken() throws IOException {
>     if (matchVersion.onOrAfter(Version.LUCENE_31))
>       return input.incrementToken(); // TODO: add some niceties for the new grammar
>     else
>       return incrementTokenClassic();
>   }
> {code}
> It seems to me that in the great refactor of the standard tokenizer, LUCENE-2167, something
was forgotten here. I think that if someone uses the ClassicTokenizer then no matter what
the version is, this filter should do what it used to do. And the TODO suggests someone forgot
to make this filter do something useful for the StandardTokenizer.  Or perhaps that idea should
be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no mention of
ClassicTokenizer, and the wiki is out of date too.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message