lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
Date Tue, 11 Jun 2019 23:52:00 GMT


Robert Muir commented on LUCENE-8816:

Mike: yes, I agree with you. The use of assert was laziness on my part: we should treat it
as technical debt and fix it. see my earlier comments on this JIRA issue for elaboration.

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>                 Key: LUCENE-8816
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
> I've inspired by this mail-list thread.
>  []
> As many Japanese already know, default built-in dictionary bundled with Kuromoji (MeCab
IPADIC) is a bit old and no longer maintained for many years. While it has been slowly obsoleted,
well-maintained and/or extended dictionaries risen up in recent years (e.g. [mecab-ipadic-neologd|],
[UniDic|]). To use them with Kuromoji, some attempts/projects/efforts
are made in Japan.
> However current architecture - dictionary bundled jar - is essentially incompatible with
the idea "switch the system dictionary", and developers have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the encoded dictionary
(language model) had been decoupled (like MeCab, the origin of Kuromoji, or lucene-gosen).
So actually decoupling them is a natural idea, and I feel that it's good time to re-think
the current architecture.
> Also this would be good for advanced users who have customized/re-trained their own system
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside the scope).
> I have not dove into the code yet, so have no idea about it's easy or difficult at this

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message