lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomoko Uchida (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
Date Mon, 10 Jun 2019 12:25:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859964#comment-16859964
] 

Tomoko Uchida commented on LUCENE-8817:
---------------------------------------

Hi [~cm],

thanks for your comment! I used the term "mecab" without any deep thought in my previous comment.
I respect your perspective and agree with that we should not use "mecab" in naming, except
for the code which handles the dictionary "MeCab IPADIC".

I also like the idea to use "viterbi" for shared tokenizer code. Meanwhile, "statistical"
sounds a little bit too general to me for describing the analyzers' functionality. I just
thought about using "morphologic" or "morph" in the module name instead of "mecab", but there
is already "morfologik" module so it would be confusing...

There is another idea: how about using "kuromoji" in the top level module name for both of
Japanese and Korean analyzers, and changing current module names "kuromoji" and "nori" to
"kuromoji-ja" and "kuromoij-ko"? They are just module names for internal use and not used
in any exposed package or class or method names (as far as I know). And they are not used
in user configuration files (as far as I know).

In order to clarify, my proposal would be changed like this. (I also changed "tools" to "dict-tools"
for clarification.)
{code:java}
analysis
└── kuromoji
         ├── common (module: analyzers-kuromoji-common)
         │       ├── build.xml
         │       └── src
         ├── ja (module: analyzers-kuromoji-ja)
         │       ├── build.xml
         │       └── src
         ├── ko (module: analyzers-kuromoji-ko)
         │       ├── build.xml
         │       └── src
         └── dict-tools  (module: analyzers-kuromoji-dict-tools)
                 ├── build.xml
                 └── src
{code}
It looks natural to me, if we pursue the integration of the two analyzers. Does the change
sound too aggressive (especially for Korean analyzer users)? I'd love to hear comments from
others. :)

> Combine Nori and Kuromoji DictionaryBuilder
> -------------------------------------------
>
>                 Key: LUCENE-8817
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8817
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Namgyu Kim
>            Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in BinaryDictionaryWriter,
...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the same system
dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is finished first, I
will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message