lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: Using stop words with snowball analyzer and shingle filter
Date Wed, 19 Sep 2012 23:15:53 GMT
The underscores are due to the fact that the StopFilter defaults to "enable 
position increments", so there are no terms at the positions where the stop 
words appeared in the source text.

Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is 
"final" so you can't subclass it to override the "createComponents" method 
that creates the StopFilter, so you would essentially have to copy the 
source for SnowballAnalyzer and then add in the code to invoke 
StopFilter.setEnablePositionIncrements the way StopFilterFactory does.

-- Jack Krupansky

-----Original Message----- 
From: Martin O'Shea
Sent: Wednesday, September 19, 2012 4:24 AM
Subject: Using stop words with snowball analyzer and shingle filter

I'm currently giving the user an option to include stop words or not when
filtering a body of text for ngram frequencies. Typically, this is done as

snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English",

shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer,

stopWords is set to either a full list of words to include in ngrams or to
remove from them. this.getnGramLength()); simply contains the current ngram
length up to a maximum of three.

If I use stopwords in filtering text "satellite is definitely falling to
Earth" for trigrams, the output is:

No=1, Key=to, Freq=1

No=2, Key=definitely, Freq=1

No=3, Key=falling to earth, Freq=1

No=4, Key=satellite, Freq=1

No=5, Key=is, Freq=1

No=6, Key=definitely falling to, Freq=1

No=7, Key=definitely falling, Freq=1

No=8, Key=falling, Freq=1

No=9, Key=to earth, Freq=1

No=10, Key=satellite is, Freq=1

No=11, Key=is definitely, Freq=1

No=12, Key=falling to, Freq=1

No=13, Key=is definitely falling, Freq=1

No=14, Key=earth, Freq=1

No=15, Key=satellite is definitely, Freq=1

But if I don't use stopwords for trigrams , the output is this:

No=1, Key=satellite, Freq=1

No=2, Key=falling _, Freq=1

No=3, Key=satellite _ _, Freq=1

No=4, Key=_ earth, Freq=1

No=5, Key=falling, Freq=1

No=6, Key=satellite _, Freq=1

No=7, Key=_ _, Freq=1

No=8, Key=_ falling _, Freq=1

No=9, Key=falling _ earth, Freq=1

No=10, Key=_, Freq=3

No=11, Key=earth, Freq=1

No=12, Key=_ _ falling, Freq=1

No=13, Key=_ falling, Freq=1

Why am I seeing underscores? I would have thought to see simple unigrams,
"satellite falling" and "falling earth", and "satellite falling earth"?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message