lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3696) building a kuromoji dictionary is very slow and eventually fails if you use java 5
Date Sat, 14 Jan 2012 17:51:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186282#comment-13186282
] 

Robert Muir commented on LUCENE-3696:
-------------------------------------

With the patch:
{noformat}
     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java]   encode...
     [java]   53645 nodes, 253185 arcs, 1954817 bytes...   done
     [java] done
     [java] building unknown word dict...done
     [java] building connection costs...done

BUILD SUCCESSFUL
Total time: 10 seconds
{noformat}
                
> building a kuromoji dictionary is very slow and eventually fails if you use java 5
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-3696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3696
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.6
>            Reporter: Robert Muir
>         Attachments: LUCENE-3696.patch
>
>
> Note: This only affects you if you use java 5 on 3.x, and it only affects you if you
want to download/rebuild the dictionary. 
> the analyzer itself works fine on 3.x with java 5.
> With java 6, building a kuromoji dictionary is quite fast:
> {noformat}
>      [java] building tokeninfo dict...
>      [java]   parse...
>      [java]   sort...
>      [java]   encode...
>      [java]   53645 nodes, 253185 arcs, 1954817 bytes...   done
>      [java] done
>      [java] building unknown word dict...done
>      [java] building connection costs...done
> BUILD SUCCESSFUL
> Total time: 6 seconds
> {noformat}
> However, if you use java 5, it takes forever and eventually runs out of memory in the
CSV parsing phase.
> So we might need to optimize the CSV parser (like precompile its patterns).
> {noformat}
>      [java] building tokeninfo dict...
>      [java]   parse...
>      [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>      [java] 	at java.util.regex.Pattern.newSlice(Pattern.java:2909)
>      [java] 	at java.util.regex.Pattern.atom(Pattern.java:1898)
>      [java] 	at java.util.regex.Pattern.sequence(Pattern.java:1794)
>      [java] 	at java.util.regex.Pattern.expr(Pattern.java:1687)
>      [java] 	at java.util.regex.Pattern.compile(Pattern.java:1397)
>      [java] 	at java.util.regex.Pattern.<init>(Pattern.java:1124)
>      [java] 	at java.util.regex.Pattern.compile(Pattern.java:817)
>      [java] 	at java.lang.String.replaceAll(String.java:2000)
>      [java] 	at org.apache.lucene.analysis.kuromoji.util.CSVUtil.unQuoteUnEscape(CSVUtil.java:84)
>      [java] 	at org.apache.lucene.analysis.kuromoji.util.CSVUtil.parse(CSVUtil.java:55)
>      [java] 	at org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:96)
>      [java] 	at org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>      [java] 	at org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>      [java] 	at org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> BUILD FAILED
> /home/rmuir/workspace/lucene-branch3x2/lucene/contrib/analyzers/kuromoji/build.xml:75:
Java returned: 1
> Total time: 2 minutes 4 seconds
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message