lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: Can I omit ShingleFilter's filler tokens
Date Thu, 12 May 2011 17:15:19 GMT
On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe <> wrote:
> A thought: one way to do #1 without modifying ShingleFilter: if there were a StopFilter
variant that accepted regular expressions instead of a stopword list, you could configure
it with a regex like /_ .*|.* _| _ / (assuming a full match is required, i.e. implicit beginning
and end anchors), and place it in the analysis pipeline after ShingleFilter to throw out shingles
with filler tokens in them.
> (It think it would be useful to generalize StopFilter to allow for more sources of stoppage,
rather than just creating a StopRegexFilter with no relation to StopFilter.)

we already did this in 3.1 by making a base FilteringTokenFilter class?
a regex filter is trivial if you subclass this (we could add something
like this untested code to the .pattern package or whatever)

public class PatternRemoveFilter extends FilteringTokenFilter {
  private final Matcher matcher;
  private final CharTermAttribute termAtt =

  public PatternRemoveFilter(boolean enablePositionIncrements,
TokenStream input, Pattern pattern) {
    super(enablePositionIncrements, input);
    matcher = pattern.matcher(termAtt);

  protected boolean accept() throws IOException {
    return !matcher.matches();

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message