lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@gmail.com>
Subject Re: WordDelimiterGraphFilter swallows emojis
Date Tue, 03 Jul 2018 14:45:30 GMT
Thanks for the pointer

On Tue, Jul 3, 2018 at 9:04 AM julien Blaize <julien.blaize@gmail.com>
wrote:

> Hello Michael,
>
> i had previously worked on emoji detection with lucene.
>
> I had to extends the Tokenizer class (and not the TokenFilter like
> WordDelimiterFilter) to preserve the delimiter attribute.
> I also had to keep track of consecutive delimiters in the character stream
> because Lucene default implementation only keep the last one.
>
> Maybe it can put you on the right track to start by looking at the
> Tokenizer instead of the TokenFilter.
>
> By the way I used the emoji list from this project to detect sequences of
> characters.
>
> https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-fr.txt
> I detect sequences of character and while the sequence is a possible emoji
> i keep tracking, when i have a full emoji i put it in the CharTermAttribute
> so it's treated as a word and not a delimiter.
>
> Regards
> --
> Julien Blaize
>
>
> Le mar. 3 juil. 2018 à 14:00, Michael Sokolov <msokolov@gmail.com> a
> écrit :
>
> > WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> > like punctuation and thus remove them, but we would like to be able to
> > search for emoji and use this filter for handling dashes, dots and other
> > intra-word punctuation.
> >
> > These filters identify non-word and non-digit characters by two
> mechanisms:
> > direct lookup in a character table, and fallback to Unicode class. The
> > character table can't easily be used to handle emoji since it would need
> to
> > be populated with the entire Unicode character set in order to reach
> > emoji-land. On the other hand, if we change the handling of emoji by
> class,
> > and say treat them as word-characters, this will also end up pulling in
> all
> > the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> > some of these other symbols are more like punctuation (this class is a
> grab
> > bag of all kinds of beautiful dingbats like trademark, degrees-symbols,
> etc
> > https://www.compart.com/en/unicode/category/So). On the other other
> hand,
> > how do we even identify emoji? I don't think the Java Character API is
> > adequate to the task. Perhaps we must incorporate a table.
> >
> > Suppose we come up with a good way to classify emoji; then how should
> they
> > be treated in this class? Sometimes they may be embedded in tokens with
> > other characters: I see people using emoji and other symbols as part of
> > their names, and sometimes they stand alone (with whitespace
> separation). I
> > think one way forward here would be to treat these as a special class
> akin
> > to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> > CATENATE_EMOJI) as we have for those classes.
> >
> > Or maybe as a convenience, we provide a way to get a table that encodes
> the
> > default classifications of all characters up to some given limit, and
> then
> > let the caller modify it? That would at least provide an easy way to
> treat
> > emoji as letters.
> >
> > Any thoughts?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message