lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomoko Uchida (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
Date Wed, 12 Jun 2019 01:55:01 GMT


Tomoko Uchida commented on LUCENE-8816:

I'd like to add some more information: the leftID (and rightID) is tied to the POS tags and
in practice there are not so much pos tag variations. I think the current constraint {{leftId
< 4096}} (or {{leftId < 8191}}, if it can be easily changed so) is perfectly okay if
following conditions are met.

1. The dictionary learner/re-trainer program included in the mecab-ipadic devtool does not
generate leftID (and rightID) values larger than 4196 (or 8191).
2. UniDic (I'd like to support this dictionary on this issue as I wrote in the issue description)
has no leftID (and rightID) values greater than 4196 (or 8191).
3. A few well-known variants of mecab-ipadic or unidic does not have leftID (and rightID)
values larger than 4196 (or 8191).

Give me some time to examine if we need to re-consider the constraint. (It's just a guess
but the original mecab itself is also a performance-savvy software, so it could have similar
restrictions for its dictionary format.) At least about the point 3, I think I can talk with
the dictionary developers about it before tackling with Lucene code, if it's needed.

There is another possibility that users give large values to leftIDs (and rightIDs) in their
customized dictionary by hand, however I don't think we should take care about that. I have
no idea about Korian dictionaries.

I agree with that it will be better to change the all assertions to some Exceptions so that
users can figure out the problem with their customized dictionary.

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>                 Key: LUCENE-8816
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
> I've inspired by this mail-list thread.
>  []
> As many Japanese already know, default built-in dictionary bundled with Kuromoji (MeCab
IPADIC) is a bit old and no longer maintained for many years. While it has been slowly obsoleted,
well-maintained and/or extended dictionaries risen up in recent years (e.g. [mecab-ipadic-neologd|],
[UniDic|]). To use them with Kuromoji, some attempts/projects/efforts
are made in Japan.
> However current architecture - dictionary bundled jar - is essentially incompatible with
the idea "switch the system dictionary", and developers have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the encoded dictionary
(language model) had been decoupled (like MeCab, the origin of Kuromoji, or lucene-gosen).
So actually decoupling them is a natural idea, and I feel that it's good time to re-think
the current architecture.
> Also this would be good for advanced users who have customized/re-trained their own system
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside the scope).
> I have not dove into the code yet, so have no idea about it's easy or difficult at this

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message