lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomoko Uchida (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
Date Sat, 01 Jun 2019 14:28:00 GMT


Tomoko Uchida commented on LUCENE-8816:

OK, if you will be able to merge two dictionary builders without any affects on kuromoji,
just open the issue (don't forget to add a link to here) and work for it. Once your branch/patch
is successfully approved to push to the upstream Lucene repo (ASF gitbox), I will merge that
from upstream and continue my work.

Please keep in mind: I won't merge any branch/patch to my local branch which has not yet merged
to upstream masterĀ  (in other words, I will merge or cherry-pick only from upstream master).
And vise versa, never merge my WIP branch/patch to your branch/patch.

I cannot take time to review your branch/patch (anyway I am still new to kuromoji or nori),
hope someone with commit privilege take care [~danmuzi]'s patch.

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>                 Key: LUCENE-8816
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
> I've inspired by this mail-list thread.
>  []
> As many Japanese already know, default built-in dictionary bundled with Kuromoji (MeCab
IPADIC) is a bit old and no longer maintained for many years. While it has been slowly obsoleted,
well-maintained and/or extended dictionaries risen up in recent years (e.g. [mecab-ipadic-neologd|],
[UniDic|]). To use them with Kuromoji, some attempts/projects/efforts
are made in Japan.
> However current architecture - dictionary bundled jar - is essentially incompatible with
the idea "switch the system dictionary", and developers have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the encoded dictionary
(language model) had been decoupled (like MeCab, the origin of Kuromoji, or lucene-gosen).
So actually decoupling them is a natural idea, and I feel that it's good time to re-think
the current architecture.
> Also this would be good for advanced users who have customized/re-trained their own system
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside the scope).
> I have not dove into the code yet, so have no idea about it's easy or difficult at this

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message