lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Schindler <...@thetaphi.de>
Subject Re: Lucene 5.5.0 StopFilter Error
Date Thu, 25 Feb 2016 21:53:35 GMT
You must build the whole stream including all filters first and then consume it. So first create
Tokenizer, then wrap by filter. Once all this is done, you can consume the filter on top using
the workflow. You don't need the Tokenizer anymore (you can remove its reference). The filter
delegates everything downstream. Finally only close the filter not the Tokenizer.

In your code you indirectly called reset twice on the Tokenizer. First direct and then implicit
through the filter. 

Uwe

Am 25. Februar 2016 22:43:30 MEZ, schrieb Jake Clawson <clawsonjake@yahoo.com.INVALID>:
>I am trying to use StopFilter in Lucene 5.5.0. I tried the following:
>
>package lucenedemo;
>
>import java.io.StringReader;
>import java.util.ArrayList;
>import java.util.Arrays;
>import java.util.Collections;
>import java.util.HashSet;
>import java.util.List;
>import java.util.Set;
>import java.util.Iterator;
>
>import org.apache.lucene.*;
>import org.apache.lucene.analysis.*;
>import org.apache.lucene.analysis.standard.*;
>import org.apache.lucene.analysis.core.StopFilter;
>import org.apache.lucene.analysis.en.EnglishAnalyzer;
>import org.apache.lucene.analysis.standard.StandardAnalyzer;
>import org.apache.lucene.analysis.standard.StandardTokenizer;
>import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>import org.apache.lucene.analysis.util.CharArraySet;
>import org.apache.lucene.util.AttributeFactory;
>import org.apache.lucene.util.Version;
>
>public class lucenedemo {
>
>public static void main(String[] args) throws Exception {
>System.out.println(removeStopWords("hello how are you? I am fine. This
>is a great day!"));
>
>}
>
>public static String removeStopWords(String strInput) throws Exception
>{
>AttributeFactory factory = AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
>StandardTokenizer tokenizer = new StandardTokenizer(factory);
>tokenizer.setReader(new StringReader(strInput));
>tokenizer.reset(); 
>CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
>
>TokenStream streamStop = new StopFilter(tokenizer, stopWords);
>StringBuilder sb = new StringBuilder();
>CharTermAttribute charTermAttribute =
>tokenizer.addAttribute(CharTermAttribute.class);
>streamStop.reset();
>while (streamStop.incrementToken()) {
>String term = charTermAttribute.toString();
>sb.append(term + " ");
>}
>
>streamStop.end();
>streamStop.close();
>
>tokenizer.close(); 
>
>
>return sb.toString();
>
>}
>
>}
>
>
>But it gives me the following error:
>
>Exception in thread "main" java.lang.IllegalStateException: TokenStream
>contract violation: reset()/close() call missing, reset() called
>multiple times, or subclass does not call super.reset(). Please see
>Javadocs of TokenStream class for more information about the correct
>consuming workflow.
>at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
>at
>org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
>at
>org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
>at
>org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
>at
>org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
>at lucenedemo.lucenedemo.removeStopWords(lucenedemo.java:42)
>at lucenedemo.lucenedemo.main(lucenedemo.java:27)
>
>What exactly am I doing wrong here? I have closed both the Tokenizer
>and TokenStream clasess. Is there something else I am missing here?
>
>Any help would be greatly appreciated.
>
>Thanks,
>Jake Clawson
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message