lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin O'Shea" <app...@dsl.pipex.com>
Subject RE: Using stop words with snowball analyzer and shingle filter
Date Thu, 20 Sep 2012 09:21:51 GMT
Thanks for the responses. They've given me much food for thought.

-----Original Message-----
From: Steven A Rowe [mailto:sarowe@syr.edu] 
Sent: 20 Sep 2012 02 19
To: java-user@lucene.apache.org
Subject: RE: Using stop words with snowball analyzer and shingle filter

Hi Martin,

SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in
Lucene 5.0.

Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on
Lucene 3.6 EnglishAnalyzer, (except substituting SnowballFilter for
PorterStemmer; disabling stopword holes' position increments; and adding
ShingleFilter), that should basically do what you want:

------
String[] stopWords = new String[] { ... }; Set<?> stopSet =
StopFilter.makeStopSet(matchVersion, stopWords); String[] stemExclusions =
new String[] { ... }; Set<?> stemExclusionsSet = new HashSet<?>();
stemExclusionsSet.addAll(Arrays.asList(stemExclusions));
matchVersion = Version.LUCENE_3X;

Analyzer analyzer = new ReusableAnalyzerBase() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader
reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream result = new StandardFilter(matchVersion, source);
    // prior to this we get the classic behavior, standardfilter does it for
us.
    if (matchVersion.onOrAfter(Version.LUCENE_31))
      result = new EnglishPossessiveFilter(matchVersion, result);
    result = new LowerCaseFilter(matchVersion, result);
    result = new StopFilter(matchVersion, result, stopSet);
    ((StopFilter)result).setEnablePositionIncrements(false);  // Disable
holes' position increments
    if (stemExclusionsSet.size() > 0) {
      result = new KeywordMarkerFilter(result, stemExclusionsSet);
    }
    result = new SnowballFilter(result, "English");
    result = new ShingleFilter(result, this.getnGramLength());
    return new TokenStreamComponents(source, result);
  }
};
------

Steve

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Wednesday, September 19, 2012 7:16 PM
To: java-user@lucene.apache.org
Subject: Re: Using stop words with snowball analyzer and shingle filter

The underscores are due to the fact that the StopFilter defaults to "enable
position increments", so there are no terms at the positions where the stop
words appeared in the source text.

Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is
"final" so you can't subclass it to override the "createComponents" method
that creates the StopFilter, so you would essentially have to copy the
source for SnowballAnalyzer and then add in the code to invoke
StopFilter.setEnablePositionIncrements the way StopFilterFactory does.

-- Jack Krupansky

-----Original Message-----
From: Martin O'Shea
Sent: Wednesday, September 19, 2012 4:24 AM
To: java-user@lucene.apache.org
Subject: Using stop words with snowball analyzer and shingle filter

I'm currently giving the user an option to include stop words or not when
filtering a body of text for ngram frequencies. Typically, this is done as
follows:



snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English",
stopWords);

shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer,
this.getnGramLength());



stopWords is set to either a full list of words to include in ngrams or to
remove from them. this.getnGramLength()); simply contains the current ngram
length up to a maximum of three.



If I use stopwords in filtering text "satellite is definitely falling to
Earth" for trigrams, the output is:



No=1, Key=to, Freq=1

No=2, Key=definitely, Freq=1

No=3, Key=falling to earth, Freq=1

No=4, Key=satellite, Freq=1

No=5, Key=is, Freq=1

No=6, Key=definitely falling to, Freq=1

No=7, Key=definitely falling, Freq=1

No=8, Key=falling, Freq=1

No=9, Key=to earth, Freq=1

No=10, Key=satellite is, Freq=1

No=11, Key=is definitely, Freq=1

No=12, Key=falling to, Freq=1

No=13, Key=is definitely falling, Freq=1

No=14, Key=earth, Freq=1

No=15, Key=satellite is definitely, Freq=1



But if I don't use stopwords for trigrams , the output is this:



No=1, Key=satellite, Freq=1

No=2, Key=falling _, Freq=1

No=3, Key=satellite _ _, Freq=1

No=4, Key=_ earth, Freq=1

No=5, Key=falling, Freq=1

No=6, Key=satellite _, Freq=1

No=7, Key=_ _, Freq=1

No=8, Key=_ falling _, Freq=1

No=9, Key=falling _ earth, Freq=1

No=10, Key=_, Freq=3

No=11, Key=earth, Freq=1

No=12, Key=_ _ falling, Freq=1

No=13, Key=_ falling, Freq=1



Why am I seeing underscores? I would have thought to see simple unigrams,
"satellite falling" and "falling earth", and "satellite falling earth"?








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message