lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Max Pfingsthorn" <m.pfingsth...@hippo.nl>
Subject RE: Implicit Stopping in StandardTokenizer??
Date Mon, 20 Jun 2005 15:48:38 GMT
Hi!

Here is the code:


import java.io.Reader;

import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;

public class SimpleStandardAnalyzer extends Analyzer {
  public SimpleStandardAnalyzer() {
  }

  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new LowerCaseFilter(
      new StandardFilter(
        new StandardTokenizer(reader)
	  )
	);
  }
}

I need it to be easily dynamically loaded via Class.forName() because we use it in a xml-configured
environment (i.e. Avalon-like). So, passing extra stuff to constructors is not really what
I am looking for. However, I guess I could make a wrapper like this:

public class SimpleStandardAnalyzer extends StandardAnalyzer {

  public SimpleStandardAnalyzer()
  {
    super(new String[0]);
  }
}

I tried this too, but still the same effect. "This" and "is", etc, get filtered out even with
no stopwords set. Any ideas?

Thanks a lot!
max


-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
Sent: Monday, June 20, 2005 16:57
To: java-user@lucene.apache.org
Subject: Re: Implicit Stopping in StandardTokenizer??



On Jun 20, 2005, at 10:41 AM, Max Pfingsthorn wrote:

> Hi!
>
> I've been trying to make an Analyzer which works like the  
> StandardAnalyzer but without stopping. For some reason though, I  
> still don't get words like "is" or "a" out of it... I checked with  
> Luke (one doc in one index with the contents  
> "hello,this,is,a,keyword,hello!,nicetomeetyou". This should  
> tokenize into "hello this is a keyword hello nicetomeetyou", but  
> actually it does "hello keyword hello nicetomeetyou". Does anyone  
> know why it drops those extra terms?

Show us the code of your analyzer.

If all you want is StandardAnalyzer to not remove stop words, you can  
construct it this way:

     analyzer = new StandardAnalyzer(new String[] {});

The String[] are the stop words to remove, in this case none.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message