lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namgyu Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
Date Sat, 01 Jun 2019 11:51:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853678#comment-16853678
] 

Namgyu Kim commented on LUCENE-8816:
------------------------------------

Oh, you're right. [~tomoko] :D
 I'll make a new JIRA issue after clean up the changes.
{quote}To avoid confusion, personally I'd like to proceed things in a right order - cleaning
up first, then generalizing. But if you are sure that we can go in parallel, can you share
your plan?
{quote}
Sure. It's an important thing.
 I think we can proceed in parallel.

There are two possible cases.
 1) You finish between JapaneseTokenizer and DictionaryBuilder job first.
 In that case, I can pull your new code and merge with nori's DictionaryBuilder.

2) I finish merging DictionaryBuilder(nori) and DictionaryBuilder(kuromoji) first.
 In that case, you can pull and continue.
 The DictionaryBuilder logic of kuromoji does not change at all in my work.

But if you think it is a little inefficient, I'll do later.
 What do you think about it?

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
>                 Key: LUCENE-8816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8816
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>
> I've inspired by this mail-list thread.
>  [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with Kuromoji (MeCab
IPADIC) is a bit old and no longer maintained for many years. While it has been slowly obsoleted,
well-maintained and/or extended dictionaries risen up in recent years (e.g. [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd],
[UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some attempts/projects/efforts
are made in Japan.
> However current architecture - dictionary bundled jar - is essentially incompatible with
the idea "switch the system dictionary", and developers have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the encoded dictionary
(language model) had been decoupled (like MeCab, the origin of Kuromoji, or lucene-gosen).
So actually decoupling them is a natural idea, and I feel that it's good time to re-think
the current architecture.
> Also this would be good for advanced users who have customized/re-trained their own system
dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside the scope).
> I have not dove into the code yet, so have no idea about it's easy or difficult at this
moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message