lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <>
Subject RE: Can I omit ShingleFilter's filler tokens
Date Wed, 11 May 2011 12:27:40 GMT
Hi Bill,

I can think of two possible interpretations of "removing filler tokens":

1. Don't create shingles across stopwords, e.g. for text "one two three four five" and stopword
"three", bigrams only, you'd get ("one two", "four five"), instead of the current ("one two",
"two _", "_ four", "four five").

2. Create shingles as if the stopwords were never there, e.g. for the same text and stopword,
bigrams only, you'd get ("one two", "two four", "four five").

Which one did you have in mind?  #2 can be achieved by adding PositionFilter after StopFilter
and before ShingleFilter.  I think #1 requires ShingleFilter modifications.


> -----Original Message-----
> From: William Koscho []
> Sent: Wednesday, May 11, 2011 12:05 AM
> To:
> Subject: Can I omit ShingleFilter's filler tokens
> Hi,
> Can I remove the filler token _ from the n-gram-tokens that are generated
> by
> a ShingleFilter?
> I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter,
> and ShingleFilter to create phrase n-grams.  The ShingleFilter inserts
> FILLER_TOKENs in place of the stopwords, but I don't want them.
> How can I omit the filler tokens?
> thanks
> Bill
View raw message