lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Can I omit ShingleFilter's filler tokens
Date Wed, 11 May 2011 13:01:32 GMT
another idea is to .setEnablePositionIncrements(false) on your stopfilter.

On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe <sarowe@syr.edu> wrote:
> Hi Bill,
>
> I can think of two possible interpretations of "removing filler tokens":
>
> 1. Don't create shingles across stopwords, e.g. for text "one two three four five" and
stopword "three", bigrams only, you'd get ("one two", "four five"), instead of the current
("one two", "two _", "_ four", "four five").
>
> 2. Create shingles as if the stopwords were never there, e.g. for the same text and stopword,
bigrams only, you'd get ("one two", "two four", "four five").
>
> Which one did you have in mind?  #2 can be achieved by adding PositionFilter after StopFilter
and before ShingleFilter.  I think #1 requires ShingleFilter modifications.
>
> Steve
>
>> -----Original Message-----
>> From: William Koscho [mailto:wkoscho@gmail.com]
>> Sent: Wednesday, May 11, 2011 12:05 AM
>> To: java-user@lucene.apache.org
>> Subject: Can I omit ShingleFilter's filler tokens
>>
>> Hi,
>>
>> Can I remove the filler token _ from the n-gram-tokens that are generated
>> by
>> a ShingleFilter?
>>
>> I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter,
>> and ShingleFilter to create phrase n-grams.  The ShingleFilter inserts
>> FILLER_TOKENs in place of the stopwords, but I don't want them.
>>
>> How can I omit the filler tokens?
>>
>> thanks
>> Bill
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message