lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-3696) building a kuromoji dictionary is very slow and eventually fails if you use java 5
Date Sat, 14 Jan 2012 17:45:40 GMT
building a kuromoji dictionary is very slow and eventually fails if you use java 5
----------------------------------------------------------------------------------

                 Key: LUCENE-3696
                 URL: https://issues.apache.org/jira/browse/LUCENE-3696
             Project: Lucene - Java
          Issue Type: Bug
    Affects Versions: 3.6
            Reporter: Robert Muir


Note: This only affects you if you use java 5 on 3.x, and it only affects you if you want
to download/rebuild the dictionary. 
the analyzer itself works fine on 3.x with java 5.

With java 6, building a kuromoji dictionary is quite fast:
{noformat}
     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java]   encode...
     [java]   53645 nodes, 253185 arcs, 1954817 bytes...   done
     [java] done
     [java] building unknown word dict...done
     [java] building connection costs...done

BUILD SUCCESSFUL
Total time: 6 seconds
{noformat}

However, if you use java 5, it takes forever and eventually runs out of memory in the CSV
parsing phase.
So we might need to optimize the CSV parser (like precompile its patterns).

{noformat}
     [java] building tokeninfo dict...
     [java]   parse...
     [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
     [java] 	at java.util.regex.Pattern.newSlice(Pattern.java:2909)
     [java] 	at java.util.regex.Pattern.atom(Pattern.java:1898)
     [java] 	at java.util.regex.Pattern.sequence(Pattern.java:1794)
     [java] 	at java.util.regex.Pattern.expr(Pattern.java:1687)
     [java] 	at java.util.regex.Pattern.compile(Pattern.java:1397)
     [java] 	at java.util.regex.Pattern.<init>(Pattern.java:1124)
     [java] 	at java.util.regex.Pattern.compile(Pattern.java:817)
     [java] 	at java.lang.String.replaceAll(String.java:2000)
     [java] 	at org.apache.lucene.analysis.kuromoji.util.CSVUtil.unQuoteUnEscape(CSVUtil.java:84)
     [java] 	at org.apache.lucene.analysis.kuromoji.util.CSVUtil.parse(CSVUtil.java:55)
     [java] 	at org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:96)
     [java] 	at org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
     [java] 	at org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
     [java] 	at org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

BUILD FAILED
/home/rmuir/workspace/lucene-branch3x2/lucene/contrib/analyzers/kuromoji/build.xml:75: Java
returned: 1

Total time: 2 minutes 4 seconds
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message