lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kazuaki Hiraga (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
Date Wed, 16 May 2012 04:00:19 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276429#comment-13276429
] 

Kazuaki Hiraga commented on LUCENE-4056:
----------------------------------------

Hi Christian,

Thank you for your comment.

I understand the situation. I didn't expect that UniDic is bundled and shipped with Kuromoji.
For the time being, I just want to buiild and use it with Kuromoji for lucene/Solr.

We just started evaluation of UniDic but it's a very early stage, so We don't have any conclusion
that We have to or need to use UniDic instead of IPA dictionary. However we haven't finished
our evaluation of UniDic, I like the concept and policy of UniDic that strictly define how
to specify the tokens. And I am satisfied with the result of tokenization. I think It's better
than IPA dictionary regarding the Katakana segmentation and compound segmentation.

On the other hand, I understand there's a license issue that We have to resolve if we decide
to use it in our internal services. Thanks for reminding me.

Thanks.
                
> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> ------------------------------------------------------------
>
>                 Key: LUCENE-4056
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4056
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6
>         Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
>            Reporter: Kazuaki Hiraga
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I
think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should
support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict',
I got the error as the below.
> build-dict:
>      [java] dictionary builder
>      [java] 
>      [java] dictionary format: UNIDIC
>      [java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
>      [java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
>      [java] input encoding: utf-8
>      [java] normalize entries: false
>      [java] 
>      [java] building tokeninfo dict...
>      [java]   parse...
>      [java]   sort...
>      [java] Exception in thread "main" java.lang.AssertionError
>      [java]   encode...
>      [java] 	at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
>      [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
>      [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>      [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>      [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===================================================================
> --- build.xml	(revision 1338023)
> +++ build.xml	(working copy)
> @@ -28,19 +28,31 @@
>    <property name="maven.dist.dir" location="../../../dist/maven" />
>  
>    <!-- default configuration: uses mecab-ipadic -->
> +  <!--
>    <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" value="http://mecab.googlecode.com/files/${dict.src.file}"/>
> +  -->
>  
>    <!-- alternative configuration: uses mecab-naist-jdic
>    <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" value="http://sourceforge.jp/frs/redir.php?m=iij&amp;f=/naist-jdic/53500/${dict.src.file}"/>
>    -->
> -  
> +
> +  <!-- alternative configuration: uses UniDic -->
> +  <property name="ipadic.version" value="unidic-mecab1312src" />
> +  <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
> +  <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
>    <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
> +  <!--
>    <property name="dict.encoding" value="euc-jp"/>
>    <property name="dict.format" value="ipadic"/>
> +  -->
> +  <property name="dict.encoding" value="utf-8"/>
> +  <property name="dict.format" value="unidic"/>
> +
>    <property name="dict.normalize" value="false"/>
>    <property name="dict.target.dir" location="./src/resources"/>
>  
> @@ -58,7 +70,8 @@
>  
>    <target name="compile-core" depends="jar-analyzers-common, common.compile-core"
/>
>    <target name="download-dict" unless="dict.available">
> -     <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
> +     <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
> +     <copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/>
>       <gunzip src="${build.dir}/${dict.src.file}"/>
>       <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
>    </target>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message