lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Implicit Stopping in StandardTokenizer??
Date Mon, 20 Jun 2005 18:53:44 GMT

On Jun 20, 2005, at 11:48 AM, Max Pfingsthorn wrote:
> import;
> import org.apache.lucene.analysis.standard.StandardTokenizer;
> import org.apache.lucene.analysis.standard.StandardFilter;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.LowerCaseFilter;
> import org.apache.lucene.analysis.StopAnalyzer;
> import org.apache.lucene.analysis.StopFilter;
> import org.apache.lucene.analysis.TokenStream;
> public class SimpleStandardAnalyzer extends Analyzer {
>   public SimpleStandardAnalyzer() {
>   }
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>     return new LowerCaseFilter(
>       new StandardFilter(
>         new StandardTokenizer(reader)
>       )
>     );
>   }
> }

That looks fine.  Stop words will not (well, "should not" it appears  
to you!) be removed from that analyzer.

> I tried this too, but still the same effect. "This" and "is", etc,  
> get filtered out even with no stopwords set. Any ideas?

The only idea I have is that you're running StandardAnalyzer and not  
this custom one by something amiss in your indirect configuration.

For example, I typed in "this is" to the AnalyzerDemo that ships with  
the Lucene in Action source code (get it from http:// after modifying AnalyzerDemo to construct a  
StandardAnalyzer like this:

     new StandardAnalyzer(new String[] {})

And got this output:

$ ant AnalyzerDemo

     [input] String to analyze: [This string will be analyzed.]
this is
      [echo] Running lia.analysis.AnalyzerDemo...
      [java] Analyzing "this is"
      [java]   WhitespaceAnalyzer:
      [java]     [this] [is]

      [java]   SimpleAnalyzer:
      [java]     [this] [is]

      [java]   StopAnalyzer:

      [java]   StandardAnalyzer:
      [java]     [this] [is]

      [java]   SnowballAnalyzer:
      [java]     [this] [is]

      [java]   SnowballAnalyzer:
      [java]     [this] [is]

      [java]   SnowballAnalyzer:
      [java]     [thi] [i]

As you can see [this] and [is] made it fine through StandardAnalyzer.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message