lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Can I omit ShingleFilter's filler tokens
Date Thu, 12 May 2011 17:15:19 GMT
On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe <sarowe@syr.edu> wrote:
> A thought: one way to do #1 without modifying ShingleFilter: if there were a StopFilter
variant that accepted regular expressions instead of a stopword list, you could configure
it with a regex like /_ .*|.* _| _ / (assuming a full match is required, i.e. implicit beginning
and end anchors), and place it in the analysis pipeline after ShingleFilter to throw out shingles
with filler tokens in them.
>
> (It think it would be useful to generalize StopFilter to allow for more sources of stoppage,
rather than just creating a StopRegexFilter with no relation to StopFilter.)
>

we already did this in 3.1 by making a base FilteringTokenFilter class?
a regex filter is trivial if you subclass this (we could add something
like this untested code to the .pattern package or whatever)

public class PatternRemoveFilter extends FilteringTokenFilter {
  private final Matcher matcher;
  private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);

  public PatternRemoveFilter(boolean enablePositionIncrements,
TokenStream input, Pattern pattern) {
    super(enablePositionIncrements, input);
    matcher = pattern.matcher(termAtt);
  }

  @Override
  protected boolean accept() throws IOException {
    matcher.reset();
    return !matcher.matches();
  }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message