lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer
Date Mon, 14 Nov 2011 14:24:51 GMT


Christian Moen commented on LUCENE-3305:

Thanks a lot, Simon!

Robert, I agree completely with your comments.  The Unicode normalization is only done at
dictionary build time.  Simon has turned it on by default -- its previous default was off.
 Perhaps it makes sense to have it on in Lucene's case...

Simon, the TokenizerRunner class doesn't seem to be included in the patch, which might be
fine.  It's not strictly necessary for Lucene, but I think it's useful to keep it there so
the analyzer can easily be run from the command line.  The DebugTokenizer and GraphvizFormatter
is there already, which aren't strictly necessary either, but sometimes quite useful, so I'm
think we should add the TokenizerRunner as well -- at least for now.

Tests didn't pass in my case, but I'll look more into this soon.  My tomorrow is very busy,
but I'll have time for this on Wednesday.

> Kuromoji code donation - a new Japanese morphological analyzer
> --------------------------------------------------------------
>                 Key: LUCENE-3305
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Christian Moen
>            Assignee: Simon Willnauer
>             Fix For: 4.0
>         Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, ip-clearance-Kuromoji.xml,
ip-clearance-Kuromoji.xml, kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz,
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese
morphological analyzer to the Apache Software Foundation in the hope that it will be useful
to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, actively maintained
and easy-to-use Java-based Japanese morphological analyzers, and these become many of our
design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, which we
hope will interest Lucene and Solr users.  Compound-nouns, such as 関西国際空港 (Kansai
International Airport) and 日本経済新聞 (Nikkei Newspaper), are segmented as one token
with most analyzers.  As a result, a search for 空港 (airport) or 新聞 (newspaper) will
not give you a for in these words.  Kuromoji can segment these words into 関西 国際 空港
and 日本 経済 新聞, which is generally what you would want for search and you'll get
a hit.
> We also wanted to make sure the technology has a license that makes it compatible with
other Apache Software Foundation software to maximize its usefulness.  Kuromoji has an Apache
License 2.0 and all code is currently owned by Atilika Inc.  The software has been developed
by my good friend and ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and its license
terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very much like
to start the code grant process.  I'm also happy to provide patches to integrate Kuromoji
into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message