lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From julien Blaize <julien.bla...@gmail.com>
Subject Re: WordDelimiterGraphFilter swallows emojis
Date Tue, 03 Jul 2018 13:03:55 GMT
Hello Michael,

i had previously worked on emoji detection with lucene.

I had to extends the Tokenizer class (and not the TokenFilter like
WordDelimiterFilter) to preserve the delimiter attribute.
I also had to keep track of consecutive delimiters in the character stream
because Lucene default implementation only keep the last one.

Maybe it can put you on the right track to start by looking at the
Tokenizer instead of the TokenFilter.

By the way I used the emoji list from this project to detect sequences of
characters.
https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-fr.txt
I detect sequences of character and while the sequence is a possible emoji
i keep tracking, when i have a full emoji i put it in the CharTermAttribute
so it's treated as a word and not a delimiter.

Regards
--
Julien Blaize


Le mar. 3 juil. 2018 à 14:00, Michael Sokolov <msokolov@gmail.com> a écrit :

> WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> like punctuation and thus remove them, but we would like to be able to
> search for emoji and use this filter for handling dashes, dots and other
> intra-word punctuation.
>
> These filters identify non-word and non-digit characters by two mechanisms:
> direct lookup in a character table, and fallback to Unicode class. The
> character table can't easily be used to handle emoji since it would need to
> be populated with the entire Unicode character set in order to reach
> emoji-land. On the other hand, if we change the handling of emoji by class,
> and say treat them as word-characters, this will also end up pulling in all
> the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> some of these other symbols are more like punctuation (this class is a grab
> bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
> https://www.compart.com/en/unicode/category/So). On the other other hand,
> how do we even identify emoji? I don't think the Java Character API is
> adequate to the task. Perhaps we must incorporate a table.
>
> Suppose we come up with a good way to classify emoji; then how should they
> be treated in this class? Sometimes they may be embedded in tokens with
> other characters: I see people using emoji and other symbols as part of
> their names, and sometimes they stand alone (with whitespace separation). I
> think one way forward here would be to treat these as a special class akin
> to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> CATENATE_EMOJI) as we have for those classes.
>
> Or maybe as a convenience, we provide a way to get a table that encodes the
> default classifications of all characters up to some given limit, and then
> let the caller modify it? That would at least provide an easy way to treat
> emoji as letters.
>
> Any thoughts?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message