lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
Date Mon, 10 Jun 2019 02:42:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859673#comment-16859673
] 

Christian Moen commented on LUCENE-8817:
----------------------------------------

Thanks, [~tomoko].  I don't think we should any "mecab" in the naming.  Please let me elaborate
a bit.

Kuromoji can read MeCab format models, but Kuromoji isn't a port of MeCab.  Kuromoji has
been developed independently without inspecting or reviewing any MeCab source code.  This
was an initial goal of the project to make sure we could use an Apache License.

The MeCab and Kuromoji feature sets are quite different and I think users will find it confusing
if they expect MeCab and find that Kuromoji is much more limited.

I'm also unsure if Kudo-san will appreciate that we make an association by name like this. 
It certainly doesn't give due credit to MeCab, in my opinion, which is a much more extensive
project.

In terms of naming, what about using "statistical" instead of "mecab" for this class of analyzers?

I'm thinking "Viterbi" could be good to refer to in shared tokenizer code.

This said, I think it could be a good to refer to "mecab" in the dictionary compiler code,
documentation, etc. to make sure users understand that we can read this model format.

Any thoughts?

> Combine Nori and Kuromoji DictionaryBuilder
> -------------------------------------------
>
>                 Key: LUCENE-8817
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8817
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Namgyu Kim
>            Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in BinaryDictionaryWriter,
...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the same system
dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is finished first, I
will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message