lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomoko Uchida (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder
Date Sun, 16 Jun 2019 00:09:00 GMT


Tomoko Uchida commented on LUCENE-8863:

As far as UniDic and trustworthy extensions (neologd or Sudachi), they should not have empty
base form or POS tag or something else (I will check it, please trust me). Empty base form
is actually a bug, and I have to report about it to the dictionary developers if those have
invalid or errorneous entries.

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --------------------------------------------------------------
>                 Key: LUCENE-8863
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Assignee: Mike Sokolov
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but really only
supports 12 bits since this was all that was needed for the IPADIC dictionary that ships with
Kuromoji. The good news is we can easily add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover problems
in the input (like these overlarge ids), but the assertions aren't enabled by default, so
an unsuspecting new user doesn't get any benefit from them, so we should upgrade to "real"
> Finally, we want to handle the case of empty base forms differently. Kuromoji does stemming
by substituting a base form for a word when there is a base form in the dictionary. Missing
base forms are expected to be supplied as {{*}}, but if a dictionary provides an empty string
base form, we would end up stripping that token completely. Since there is no possible meaning
for an empty base form (and the dictionary builder already treats {{*}} and empty strings
as equivalent in a number of other cases), I think we should simply ignore empty base forms
(rather than replacing words with empty strings when tokenizing!)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message