lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@gmail.com>
Subject Re: WordDelimiterGraphFilter swallows emojis
Date Tue, 03 Jul 2018 21:04:57 GMT
Ah I see -- there is \p{Emoji} to start with, which is nice, but also this
extended pictographic -- I'll read more, and get back if I have questions.
Might be a little while before I dig in to this though. Thanks again

On Tue, Jul 3, 2018 at 11:25 AM Robert Muir <rcmuir@gmail.com> wrote:

> If you customized the rules, maybe have a look at
> https://issues.apache.org/jira/browse/LUCENE-8366
>
> The rules got simpler and we also updated the customization example
> used for the factory's test.
>
> On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <msokolov@gmail.com>
> wrote:
> > Yes that sounds good -- this ConditionalTokenFilter is going to be very
> > helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
> > around and see about incorporating the emoji rules from there.  Thanks
> > Robert
> >
> > On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> >> > Any thoughts?
> >>
> >> best idea I have would be to tokenize with ICUTokenizer, which will
> >> tag emoji sequences as "<EMOJI>" token type, then use
> >> ConditionalTokenFilter to send all tokens EXCEPT those with token type
> >> of  "<EMOJI>" to your WordDelimiterFilter. This way
> >> WordDelimiterFilter never sees the emoji at all and can't screw them
> >> up.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message