lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namgyu Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder
Date Mon, 10 Jun 2019 16:23:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860128#comment-16860128
] 

Namgyu Kim commented on LUCENE-8817:
------------------------------------

Thank you for your replies. [~tomoko] and [~cm] :D

I was surprised at your deep thoughts.
{code:java}
analysis
└── ???
         ├── common (module: analyzers-???-common)
         │       ├── build.xml
         │       └── src
         ├── kuromoji (module: analyzers-???-kuromoji)
         │       ├── build.xml
         │       └── src
         ├── nori (module: analyzers-???-nori)
         │       ├── build.xml
         │       └── src
         └── tools  (module: analyzers-???-tools)
                 ├── build.xml
                 └── src
{code}
I agree with the module structure proposed by Tomoko.
 In my personal opinion, "analysis" is better than "analyzers".
{quote}In terms of naming, what about using "statistical" instead of "mecab" for this class
of analyzers?
 I'm thinking "Viterbi" could be good to refer to in shared tokenizer code.
 This said, I think it could be a good to refer to "mecab" in the dictionary compiler code,
documentation, etc. to make sure users understand that we can read this model format.
 Any thoughts?
{quote}
About the name, the folder name "viterbi" looks much better than "statistical".
 But to be perfectly honest, I'm not sure that it's really right to use the algorithm name
as the folder name.
 Most users probably don't know what viterbi is.
 It is also associated with the package name, and "org.apache.lucene.analysis.viterbi.ja"
or "~.viterbi.ko" will confuse users.
 Or just use "org.apache.lucene.analysis.ja", it could be fine.
 It's because analysis-common is already doing like it.
 (not org.apache.lucene.common.cjk)
 It doesn't matter if we use it for administrative purposes, but I also want to hear some
opinions from others.
{quote}how about using "kuromoji" in the top level module name for both of Japanese and Korean
analyzers, and changing current module names "kuromoji" and "nori" to "kuromoji-ja" and "kuromoij-ko"?
{quote}
I personally don't agree to use kuromoji-ko instead of nori.
nori is already a familiar name to users.
They may be confused about it.

> Combine Nori and Kuromoji DictionaryBuilder
> -------------------------------------------
>
>                 Key: LUCENE-8817
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8817
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Namgyu Kim
>            Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in BinaryDictionaryWriter,
...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the same system
dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is finished first, I
will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message