lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Can I omit ShingleFilter's filler tokens
Date Wed, 11 May 2011 13:16:20 GMT
Yes, StopFilter.setEnablePositionIncrements(false) will almost certainly get higher throughput
than inserting PositionFilter.  Like PositionFilter, this will buy you #2 (create shingles
as if stopwords were never there), but not #1 (don't create shingles across stopwords).

> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, May 11, 2011 9:02 AM
> To: java-user@lucene.apache.org
> Subject: Re: Can I omit ShingleFilter's filler tokens
> 
> another idea is to .setEnablePositionIncrements(false) on your
> stopfilter.
> 
> On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe <sarowe@syr.edu> wrote:
> > Hi Bill,
> >
> > I can think of two possible interpretations of "removing filler
> tokens":
> >
> > 1. Don't create shingles across stopwords, e.g. for text "one two three
> four five" and stopword "three", bigrams only, you'd get ("one two",
> "four five"), instead of the current ("one two", "two _", "_ four", "four
> five").
> >
> > 2. Create shingles as if the stopwords were never there, e.g. for the
> same text and stopword, bigrams only, you'd get ("one two", "two four",
> "four five").
> >
> > Which one did you have in mind?  #2 can be achieved by adding
> PositionFilter after StopFilter and before ShingleFilter.  I think #1
> requires ShingleFilter modifications.
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: William Koscho [mailto:wkoscho@gmail.com]
> >> Sent: Wednesday, May 11, 2011 12:05 AM
> >> To: java-user@lucene.apache.org
> >> Subject: Can I omit ShingleFilter's filler tokens
> >>
> >> Hi,
> >>
> >> Can I remove the filler token _ from the n-gram-tokens that are
> generated
> >> by
> >> a ShingleFilter?
> >>
> >> I'm using a chain of filters: ClassicFilter, StopFilter,
> LowerCaseFilter,
> >> and ShingleFilter to create phrase n-grams.  The ShingleFilter inserts
> >> FILLER_TOKENs in place of the stopwords, but I don't want them.
> >>
> >> How can I omit the filler tokens?
> >>
> >> thanks
> >> Bill
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message