lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igal @ getRailo.org" <i...@getrailo.org>
Subject Re: Removing Empty Shingles in Lucene 4
Date Thu, 01 Nov 2012 21:19:10 GMT
hi Steve,

you are correct.  I am using StandardTokenizer.  I will look into the 
WhitespaceTokenizer and hopefully figure it out.

thank you,


Igal


On 11/1/2012 1:24 PM, Steve Rowe wrote:
> Hi Igal,
>
> You didn't say you were using StandardTokenizer, but assuming you are, right now StandardTokenizer
throws away punctuation, so no following filters will see them.
>
> If StandardTokenizer were modified to also output currently non-tokenized punctuation
as tokens, then you could use a FilteringTokenFilter that removes any shingle containing commas.
  See [1] and [3] for previous discussions on this topic.
>
> For right now, if you use something like WhitespaceTokenizer, you could have a FilteringTokenFilter
to remove shingles with non-final-token commas, and then another filter that strips commas
everywhere.
>
> Steve
>
> [1] Mike McCandless's post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>
>
> [2] dev@l.a.o thread "Improve OOTB behavior: English word-splitting should default to
autoGeneratePhraseQueries=true" <http://markmail.org/message/ewza54azui6knqwf>
>
> On Nov 1, 2012, at 3:44 PM, Igal @ getRailo.org <igal@getrailo.org> wrote:
>
>> hi,
>>
>> I'm trying to migrate to Lucene 4.
>>
>> in Lucene 3.5 I extended org.apache.lucene.analysis.FilteringTokenFilter and overrode
accept() to remove undesired shingles.  in Lucene 4 org.apache.lucene.analysis.FilteringTokenFilter
does not exist?
>>
>> I'm trying to achieve two things:
>>
>> 1) remove shingles that have an empty item.
>>
>> 2) remove shingles when the phrase contains a comma, for example:
>>
>>     for the phrase:    "delicious red apples, green pears, and oranges"
>>
>> I want the following shingles (with a shingle size of 2):
>>
>> "delicious red", "red apples", "green pears", "and oranges"
>> (no "apples green" because there's a comma)
>> (no "pears and" because there's a comma)
>>
>> any ideas?
>>
>> TIA
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message