lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@gmail.com>
Subject Re: WordDelimiterGraphFilter swallows emojis
Date Tue, 03 Jul 2018 14:46:50 GMT
Yes that sounds good -- this ConditionalTokenFilter is going to be very
helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
around and see about incorporating the emoji rules from there.  Thanks
Robert

On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rcmuir@gmail.com> wrote:

> > Any thoughts?
>
> best idea I have would be to tokenize with ICUTokenizer, which will
> tag emoji sequences as "<EMOJI>" token type, then use
> ConditionalTokenFilter to send all tokens EXCEPT those with token type
> of  "<EMOJI>" to your WordDelimiterFilter. This way
> WordDelimiterFilter never sees the emoji at all and can't screw them
> up.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message