lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Using stop words with snowball analyzer and shingle filter
Date Thu, 20 Sep 2012 01:19:13 GMT
Hi Martin,

SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in Lucene 5.0.

Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on Lucene 3.6 EnglishAnalyzer,
(except substituting SnowballFilter for PorterStemmer; disabling stopword holes' position
increments; and adding ShingleFilter), that should basically do what you want:

------
String[] stopWords = new String[] { ... };
Set<?> stopSet = StopFilter.makeStopSet(matchVersion, stopWords);
String[] stemExclusions = new String[] { ... };
Set<?> stemExclusionsSet = new HashSet<?>();
stemExclusionsSet.addAll(Arrays.asList(stemExclusions));
matchVersion = Version.LUCENE_3X;

Analyzer analyzer = new ReusableAnalyzerBase() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream result = new StandardFilter(matchVersion, source);
    // prior to this we get the classic behavior, standardfilter does it for us.
    if (matchVersion.onOrAfter(Version.LUCENE_31))
      result = new EnglishPossessiveFilter(matchVersion, result);
    result = new LowerCaseFilter(matchVersion, result);
    result = new StopFilter(matchVersion, result, stopSet);
    ((StopFilter)result).setEnablePositionIncrements(false);  // Disable holes' position increments
    if (stemExclusionsSet.size() > 0) {
      result = new KeywordMarkerFilter(result, stemExclusionsSet);
    }
    result = new SnowballFilter(result, "English");
    result = new ShingleFilter(result, this.getnGramLength());
    return new TokenStreamComponents(source, result);
  }
};
------

Steve

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: Wednesday, September 19, 2012 7:16 PM
To: java-user@lucene.apache.org
Subject: Re: Using stop words with snowball analyzer and shingle filter

The underscores are due to the fact that the StopFilter defaults to "enable 
position increments", so there are no terms at the positions where the stop 
words appeared in the source text.

Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is 
"final" so you can't subclass it to override the "createComponents" method 
that creates the StopFilter, so you would essentially have to copy the 
source for SnowballAnalyzer and then add in the code to invoke 
StopFilter.setEnablePositionIncrements the way StopFilterFactory does.

-- Jack Krupansky

-----Original Message----- 
From: Martin O'Shea
Sent: Wednesday, September 19, 2012 4:24 AM
To: java-user@lucene.apache.org
Subject: Using stop words with snowball analyzer and shingle filter

I'm currently giving the user an option to include stop words or not when
filtering a body of text for ngram frequencies. Typically, this is done as
follows:



snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English",
stopWords);

shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer,
this.getnGramLength());



stopWords is set to either a full list of words to include in ngrams or to
remove from them. this.getnGramLength()); simply contains the current ngram
length up to a maximum of three.



If I use stopwords in filtering text "satellite is definitely falling to
Earth" for trigrams, the output is:



No=1, Key=to, Freq=1

No=2, Key=definitely, Freq=1

No=3, Key=falling to earth, Freq=1

No=4, Key=satellite, Freq=1

No=5, Key=is, Freq=1

No=6, Key=definitely falling to, Freq=1

No=7, Key=definitely falling, Freq=1

No=8, Key=falling, Freq=1

No=9, Key=to earth, Freq=1

No=10, Key=satellite is, Freq=1

No=11, Key=is definitely, Freq=1

No=12, Key=falling to, Freq=1

No=13, Key=is definitely falling, Freq=1

No=14, Key=earth, Freq=1

No=15, Key=satellite is definitely, Freq=1



But if I don't use stopwords for trigrams , the output is this:



No=1, Key=satellite, Freq=1

No=2, Key=falling _, Freq=1

No=3, Key=satellite _ _, Freq=1

No=4, Key=_ earth, Freq=1

No=5, Key=falling, Freq=1

No=6, Key=satellite _, Freq=1

No=7, Key=_ _, Freq=1

No=8, Key=_ falling _, Freq=1

No=9, Key=falling _ earth, Freq=1

No=10, Key=_, Freq=3

No=11, Key=earth, Freq=1

No=12, Key=_ _ falling, Freq=1

No=13, Key=_ falling, Freq=1



Why am I seeing underscores? I would have thought to see simple unigrams,
"satellite falling" and "falling earth", and "satellite falling earth"?








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message