lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: WordDelimiterGraphFilter swallows emojis
Date Tue, 03 Jul 2018 15:25:23 GMT
If you customized the rules, maybe have a look at
https://issues.apache.org/jira/browse/LUCENE-8366

The rules got simpler and we also updated the customization example
used for the factory's test.

On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <msokolov@gmail.com> wrote:
> Yes that sounds good -- this ConditionalTokenFilter is going to be very
> helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
> around and see about incorporating the emoji rules from there.  Thanks
> Robert
>
> On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> > Any thoughts?
>>
>> best idea I have would be to tokenize with ICUTokenizer, which will
>> tag emoji sequences as "<EMOJI>" token type, then use
>> ConditionalTokenFilter to send all tokens EXCEPT those with token type
>> of  "<EMOJI>" to your WordDelimiterFilter. This way
>> WordDelimiterFilter never sees the emoji at all and can't screw them
>> up.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message