lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Can I omit ShingleFilter's filler tokens
Date Thu, 12 May 2011 17:23:20 GMT
Cool!  I had forgotten about FilteringTokenFilter.

Elmo, would you care to make a JIRA issue and a patch (based on Robert's code, and adding
some tests) to create this?  If so, this may be useful:

	http://wiki.apache.org/lucene-java/HowToContribute

Steve

> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Thursday, May 12, 2011 1:15 PM
> To: java-user@lucene.apache.org
> Subject: Re: Can I omit ShingleFilter's filler tokens
> 
> On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe <sarowe@syr.edu> wrote:
> > A thought: one way to do #1 without modifying ShingleFilter: if there
> were a StopFilter variant that accepted regular expressions instead of a
> stopword list, you could configure it with a regex like /_ .*|.* _| _ /
> (assuming a full match is required, i.e. implicit beginning and end
> anchors), and place it in the analysis pipeline after ShingleFilter to
> throw out shingles with filler tokens in them.
> >
> > (It think it would be useful to generalize StopFilter to allow for more
> sources of stoppage, rather than just creating a StopRegexFilter with no
> relation to StopFilter.)
> >
> 
> we already did this in 3.1 by making a base FilteringTokenFilter class?
> a regex filter is trivial if you subclass this (we could add something
> like this untested code to the .pattern package or whatever)
> 
> public class PatternRemoveFilter extends FilteringTokenFilter {
>   private final Matcher matcher;
>   private final CharTermAttribute termAtt =
> addAttribute(CharTermAttribute.class);
> 
>   public PatternRemoveFilter(boolean enablePositionIncrements,
> TokenStream input, Pattern pattern) {
>     super(enablePositionIncrements, input);
>     matcher = pattern.matcher(termAtt);
>   }
> 
>   @Override
>   protected boolean accept() throws IOException {
>     matcher.reset();
>     return !matcher.matches();
>   }
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message