lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Removing Empty Shingles in Lucene 4
Date Thu, 01 Nov 2012 20:24:54 GMT
Hi Igal,

You didn't say you were using StandardTokenizer, but assuming you are, right now StandardTokenizer
throws away punctuation, so no following filters will see them.

If StandardTokenizer were modified to also output currently non-tokenized punctuation as tokens,
then you could use a FilteringTokenFilter that removes any shingle containing commas.   See
[1] and [3] for previous discussions on this topic.

For right now, if you use something like WhitespaceTokenizer, you could have a FilteringTokenFilter
to remove shingles with non-final-token commas, and then another filter that strips commas
everywhere.

Steve

[1] Mike McCandless's post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>

[2] dev@l.a.o thread "Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true"
<http://markmail.org/message/ewza54azui6knqwf>

On Nov 1, 2012, at 3:44 PM, Igal @ getRailo.org <igal@getrailo.org> wrote:

> hi,
> 
> I'm trying to migrate to Lucene 4.
> 
> in Lucene 3.5 I extended org.apache.lucene.analysis.FilteringTokenFilter and overrode
accept() to remove undesired shingles.  in Lucene 4 org.apache.lucene.analysis.FilteringTokenFilter
does not exist?
> 
> I'm trying to achieve two things:
> 
> 1) remove shingles that have an empty item.
> 
> 2) remove shingles when the phrase contains a comma, for example:
> 
>    for the phrase:    "delicious red apples, green pears, and oranges"
> 
> I want the following shingles (with a shingle size of 2):
> 
> "delicious red", "red apples", "green pears", "and oranges"
> (no "apples green" because there's a comma)
> (no "pears and" because there's a comma)
> 
> any ideas?
> 
> TIA
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message