Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 13B36BB3E for ; Sat, 14 Jan 2012 17:46:04 +0000 (UTC) Received: (qmail 29196 invoked by uid 500); 14 Jan 2012 17:46:02 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 29107 invoked by uid 500); 14 Jan 2012 17:46:01 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 29096 invoked by uid 99); 14 Jan 2012 17:46:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Jan 2012 17:46:01 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Jan 2012 17:46:00 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 6DF4014CFD3 for ; Sat, 14 Jan 2012 17:45:40 +0000 (UTC) Date: Sat, 14 Jan 2012 17:45:40 +0000 (UTC) From: "Robert Muir (Created) (JIRA)" To: dev@lucene.apache.org Message-ID: <1260461322.41374.1326563140452.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Created] (LUCENE-3696) building a kuromoji dictionary is very slow and eventually fails if you use java 5 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 building a kuromoji dictionary is very slow and eventually fails if you use java 5 ---------------------------------------------------------------------------------- Key: LUCENE-3696 URL: https://issues.apache.org/jira/browse/LUCENE-3696 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.6 Reporter: Robert Muir Note: This only affects you if you use java 5 on 3.x, and it only affects you if you want to download/rebuild the dictionary. the analyzer itself works fine on 3.x with java 5. With java 6, building a kuromoji dictionary is quite fast: {noformat} [java] building tokeninfo dict... [java] parse... [java] sort... [java] encode... [java] 53645 nodes, 253185 arcs, 1954817 bytes... done [java] done [java] building unknown word dict...done [java] building connection costs...done BUILD SUCCESSFUL Total time: 6 seconds {noformat} However, if you use java 5, it takes forever and eventually runs out of memory in the CSV parsing phase. So we might need to optimize the CSV parser (like precompile its patterns). {noformat} [java] building tokeninfo dict... [java] parse... [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space [java] at java.util.regex.Pattern.newSlice(Pattern.java:2909) [java] at java.util.regex.Pattern.atom(Pattern.java:1898) [java] at java.util.regex.Pattern.sequence(Pattern.java:1794) [java] at java.util.regex.Pattern.expr(Pattern.java:1687) [java] at java.util.regex.Pattern.compile(Pattern.java:1397) [java] at java.util.regex.Pattern.(Pattern.java:1124) [java] at java.util.regex.Pattern.compile(Pattern.java:817) [java] at java.lang.String.replaceAll(String.java:2000) [java] at org.apache.lucene.analysis.kuromoji.util.CSVUtil.unQuoteUnEscape(CSVUtil.java:84) [java] at org.apache.lucene.analysis.kuromoji.util.CSVUtil.parse(CSVUtil.java:55) [java] at org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:96) [java] at org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76) [java] at org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.build(DictionaryBuilder.java:37) [java] at org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.main(DictionaryBuilder.java:82) BUILD FAILED /home/rmuir/workspace/lucene-branch3x2/lucene/contrib/analyzers/kuromoji/build.xml:75: Java returned: 1 Total time: 2 minutes 4 seconds {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org