lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Koscho <wkos...@gmail.com>
Subject Re: Can I omit ShingleFilter's filler tokens
Date Wed, 11 May 2011 16:10:19 GMT
I meant I'm trying for #2 so this should work (got my numbers mixed
up). Thanks again

Bill

On 5/11/11, William Koscho <wkoscho@gmail.com> wrote:
> #1 is what I'm trying for, so Ill give setPositionIncrements(false) a
> try. Thanks for everyone's help.
>
> Bill
>
> On 5/11/11, Steven A Rowe <sarowe@syr.edu> wrote:
>> Yes, StopFilter.setEnablePositionIncrements(false) will almost certainly
>> get
>> higher throughput than inserting PositionFilter.  Like PositionFilter,
>> this
>> will buy you #2 (create shingles as if stopwords were never there), but
>> not
>> #1 (don't create shingles across stopwords).
>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Wednesday, May 11, 2011 9:02 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Can I omit ShingleFilter's filler tokens
>>>
>>> another idea is to .setEnablePositionIncrements(false) on your
>>> stopfilter.
>>>
>>> On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe <sarowe@syr.edu> wrote:
>>> > Hi Bill,
>>> >
>>> > I can think of two possible interpretations of "removing filler
>>> tokens":
>>> >
>>> > 1. Don't create shingles across stopwords, e.g. for text "one two
>>> > three
>>> four five" and stopword "three", bigrams only, you'd get ("one two",
>>> "four five"), instead of the current ("one two", "two _", "_ four",
>>> "four
>>> five").
>>> >
>>> > 2. Create shingles as if the stopwords were never there, e.g. for the
>>> same text and stopword, bigrams only, you'd get ("one two", "two four",
>>> "four five").
>>> >
>>> > Which one did you have in mind?  #2 can be achieved by adding
>>> PositionFilter after StopFilter and before ShingleFilter.  I think #1
>>> requires ShingleFilter modifications.
>>> >
>>> > Steve
>>> >
>>> >> -----Original Message-----
>>> >> From: William Koscho [mailto:wkoscho@gmail.com]
>>> >> Sent: Wednesday, May 11, 2011 12:05 AM
>>> >> To: java-user@lucene.apache.org
>>> >> Subject: Can I omit ShingleFilter's filler tokens
>>> >>
>>> >> Hi,
>>> >>
>>> >> Can I remove the filler token _ from the n-gram-tokens that are
>>> generated
>>> >> by
>>> >> a ShingleFilter?
>>> >>
>>> >> I'm using a chain of filters: ClassicFilter, StopFilter,
>>> LowerCaseFilter,
>>> >> and ShingleFilter to create phrase n-grams.  The ShingleFilter
>>> >> inserts
>>> >> FILLER_TOKENs in place of the stopwords, but I don't want them.
>>> >>
>>> >> How can I omit the filler tokens?
>>> >>
>>> >> thanks
>>> >> Bill
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> --
> Sent from my mobile device
>

-- 
Sent from my mobile device

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message