Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6F6A39129 for ; Tue, 27 Mar 2012 16:50:53 +0000 (UTC) Received: (qmail 51610 invoked by uid 500); 27 Mar 2012 16:50:52 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 51547 invoked by uid 500); 27 Mar 2012 16:50:52 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 51540 invoked by uid 99); 27 Mar 2012 16:50:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Mar 2012 16:50:52 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Mar 2012 16:50:48 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id BCEFD349C91 for ; Tue, 27 Mar 2012 16:50:27 +0000 (UTC) Date: Tue, 27 Mar 2012 16:50:27 +0000 (UTC) From: "Christian Moen (Issue Comment Edited) (JIRA)" To: dev@lucene.apache.org Message-ID: <1455991011.24007.1332867027775.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1960192461.22739.1332845247887.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239579#comment-13239579 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:49 PM: --------------------------------------------------------------- h5. Test 1: Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= ' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real 18m39.206s user 0m12.682s sys 0m11.065s {noformat} The GC log looks fine with a maximum GC time of 0.0187319 seconds. There wasn't even a full GC probably like to the large heap size. However, if Kuromoji was generating garbage, I'd expect to see it here since input in XML format is 1.7GB and the Viterbi would generate data many many times that size during tokenization. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Note that GCViewer had problems parsing the log file so the data in the screenshot might be off. was (Author: cm): h5. Test 1: Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= ' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real 18m39.206s user 0m12.682s sys 0m11.065s {noformat} The GC log looks fine with a maximum GC time of 0.0187319 seconds. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Note that GCViewer had problems parsing the log file so the data in the screenshot might be off. > Perform Kuromoji/Japanese stability test before 3.6 freeze > ---------------------------------------------------------- > > Key: SOLR-3282 > URL: https://issues.apache.org/jira/browse/SOLR-3282 > Project: Solr > Issue Type: Task > Components: Schema and Analysis > Affects Versions: 3.6, 4.0 > Reporter: Christian Moen > Assignee: Christian Moen > Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png > > > Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. > My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: > # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop > # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on > While Solr is indexing and searching, I'd like to verify that: > * Indexing and queries are working as expected > * Memory and heap usage looks stable over time > * Garbage collection is overall low over time -- no Full-GC issues > I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org