lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomoko Uchida (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary
Date Sat, 01 Jun 2019 00:30:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853504#comment-16853504
] 

Tomoko Uchida edited comment on LUCENE-8816 at 6/1/19 12:29 AM:
----------------------------------------------------------------

{quote}I don't think it would be difficult to merge DictionaryBuilder. (except BinaryDirectoryWriter)

But I think BinaryDirectoryWriter case can be solved if we separate methods. (+ use DictionaryFormat)
 Can I try this when you are concentrating on JapaneseTokenizer?
{quote}
Please open another issue for this. Any generalization of DictionaryBuilder and its auxiliary
classes should be treated another issue, if you agree with my plan.
{quote} - First, decouple the encoded system dictionary (mecab-ipadic) to a separated jar
from the kuromoji jar and clean up the dictionary builder tool. This is the scope of this
issue.
 - Then generalize the dictionary builder tool to make it able to handle Korean dictionary
(mecab-ko-dic), on the separated issue.
 - Lastly decouple the korean system dictionary to a separated jar from the nori jar, maybe
on the another issue.{quote}
Just to make things clear, I will focus on JapaneseTokenizer *and* DictionaryBuilder in kuromoji
module on this issue so the DictionaryBuilder (including its auxiliary classes) can be significantly
modified here. To avoid confusion, personally I'd like to proceed things in a right order
- cleaning up first, then generalizing. But if you are sure that we can go in parallel, can
you share your plan?


was (Author: tomoko uchida):
{quote}But I think BinaryDirectoryWriter case can be solved if we separate methods. (+ use
DictionaryFormat)
 Can I try this when you are concentrating on JapaneseTokenizer?
{quote}
Please open another issue for this. Any generalization of DictionaryBuilder and its auxiliary
classes should be treated another issue, if you agree with my plan.
{quote} - First, decouple the encoded system dictionary (mecab-ipadic) to a separated jar
from the kuromoji jar and clean up the dictionary builder tool. This is the scope of this
issue.
 - Then generalize the dictionary builder tool to make it able to handle Korean dictionary
(mecab-ko-dic), on the separated issue.
 - Lastly decouple the korean system dictionary to a separated jar from the nori jar, maybe
on the another issue.{quote}
Just to make things clear, I will focus on JapaneseTokenizer *and* DictionaryBuilder in kuromoji
module on this issue so the DictionaryBuilder (including its auxiliary classes) can be significantly
modified here. To avoid confusion, personally I'd like to proceed things in a right order
- cleaning up first, then generalizing. But if you are sure that we can go in parallel, can
you share your plan?

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
>                 Key: LUCENE-8816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8816
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>
> I've inspired by this mail-list thread.
>  [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with Kuromoji (MeCab
IPADIC) is a bit old and no longer maintained for many years. While it has been slowly obsoleted,
well-maintained and/or extended dictionaries risen up in recent years (e.g. [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd],
[UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some attempts/projects/efforts
are made in Japan.
> However current architecture - dictionary bundled jar - is essentially incompatible with
the idea "switch the system dictionary", and developers have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the encoded dictionary
(language model) had been decoupled (like MeCab, the origin of Kuromoji, or lucene-gosen).
So actually decoupling them is a natural idea, and I feel that it's good time to re-think
the current architecture.
> Also this would be good for advanced users who have customized/re-trained their own system
dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside the scope).
> I have not dove into the code yet, so have no idea about it's easy or difficult at this
moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message