lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kazuaki Hiraga (JIRA)" <>
Subject [jira] [Created] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
Date Mon, 14 May 2012 06:21:48 GMT
Kazuaki Hiraga created LUCENE-4056:

             Summary: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
                 Key: LUCENE-4056
             Project: Lucene - Java
          Issue Type: Improvement
          Components: modules/analysis
    Affects Versions: 3.6
         Environment: Solr 3.6
UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
            Reporter: Kazuaki Hiraga

I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think
UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support
UniDic dictionary as standalone Kuromoji does.

The following is my procedure:
Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict',
I got the error as the below.

     [java] dictionary builder
     [java] dictionary format: UNIDIC
     [java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
     [java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
     [java] input encoding: utf-8
     [java] normalize entries: false
     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java] Exception in thread "main" java.lang.AssertionError
     [java]   encode...
     [java] 	at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(
     [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(
     [java] 	at
     [java] 	at
     [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(

And the diff of build.xml:

--- build.xml	(revision 1338023)
+++ build.xml	(working copy)
@@ -28,19 +28,31 @@
   <property name="maven.dist.dir" location="../../../dist/maven" />
   <!-- default configuration: uses mecab-ipadic -->
+  <!--
   <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
   <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
   <property name="dict.url" value="${dict.src.file}"/>
+  -->
   <!-- alternative configuration: uses mecab-naist-jdic
   <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
   <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
   <property name="dict.url" value=";f=/naist-jdic/53500/${dict.src.file}"/>
+  <!-- alternative configuration: uses UniDic -->
+  <property name="ipadic.version" value="unidic-mecab1312src" />
+  <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
+  <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/>
   <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
+  <!--
   <property name="dict.encoding" value="euc-jp"/>
   <property name="dict.format" value="ipadic"/>
+  -->
+  <property name="dict.encoding" value="utf-8"/>
+  <property name="dict.format" value="unidic"/>
   <property name="dict.normalize" value="false"/>
   <property name="" location="./src/resources"/>
@@ -58,7 +70,8 @@
   <target name="compile-core" depends="jar-analyzers-common, common.compile-core" />
   <target name="download-dict" unless="dict.available">
-     <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
+     <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
+     <copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/>
      <gunzip src="${build.dir}/${dict.src.file}"/>
      <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message