Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 31235 invoked from network); 28 Jun 2009 04:54:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jun 2009 04:54:59 -0000 Received: (qmail 40494 invoked by uid 500); 28 Jun 2009 04:55:09 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 40409 invoked by uid 500); 28 Jun 2009 04:55:09 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 40401 invoked by uid 99); 28 Jun 2009 04:55:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 28 Jun 2009 04:55:09 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 28 Jun 2009 04:55:07 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2C568234C004 for ; Sat, 27 Jun 2009 21:54:47 -0700 (PDT) Message-ID: <766671743.1246164887167.JavaMail.jira@brutus> Date: Sat, 27 Jun 2009 21:54:47 -0700 (PDT) From: "Steven Rowe (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter In-Reply-To: <1464189867.1246164527266.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe updated LUCENE-1719: -------------------------------- Description: contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package. The javadocs of these classes should be modified to add a note to this effect. My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter. I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine. I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination. The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows: ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement|| |1.4.2_17 (32 bit)|English|522|212|13|2.6x| |1.4.2_17 (32 bit)|French|716|243|14|3.1x| |1.4.2_17 (32 bit)|German|669|264|16|2.6x| |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| |1.5.0_15 (32 bit)|English|604|176|16|3.7x| |1.5.0_15 (32 bit)|French|817|209|17|4.2x| |1.5.0_15 (32 bit)|German|799|225|20|3.8x| |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| |1.5.0_15 (64 bit)|English|431|89|10|5.3x| |1.5.0_15 (64 bit)|French|562|112|11|5.5x| |1.5.0_15 (64 bit)|German|567|116|13|5.4x| |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| |1.6.0_13 (64 bit)|English|162|81|9|2.1x| |1.6.0_13 (64 bit)|French|192|92|10|2.2x| |1.6.0_13 (64 bit)|German|204|99|14|2.2x| |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| was: contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package. The javadocs of these classes should be modified to add a note to this effect. My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter. I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine. I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination, and then subtracted it from both of the collation key analysis chains' times. The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows: ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement|| |1.4.2_17 (32 bit)|English|522|212|13|2.6x| |1.4.2_17 (32 bit)|French|716|243|14|3.1x| |1.4.2_17 (32 bit)|German|669|264|16|2.6x| |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| |1.5.0_15 (32 bit)|English|604|176|16|3.7x| |1.5.0_15 (32 bit)|French|817|209|17|4.2x| |1.5.0_15 (32 bit)|German|799|225|20|3.8x| |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| |1.5.0_15 (64 bit)|English|431|89|10|5.3x| |1.5.0_15 (64 bit)|French|562|112|11|5.5x| |1.5.0_15 (64 bit)|German|567|116|13|5.4x| |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| |1.6.0_13 (64 bit)|English|162|81|9|2.1x| |1.6.0_13 (64 bit)|French|192|92|10|2.2x| |1.6.0_13 (64 bit)|German|204|99|14|2.2x| |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| Lucene Fields: [New, Patch Available] (was: [New]) > Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter > --------------------------------------------------------------------------------------- > > Key: LUCENE-1719 > URL: https://issues.apache.org/jira/browse/LUCENE-1719 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* > Affects Versions: 2.4.1 > Reporter: Steven Rowe > Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1719.patch > > > contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package. The javadocs of these classes should be modified to add a note to this effect. > My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter. > I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine. I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination. The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows: > ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement|| > |1.4.2_17 (32 bit)|English|522|212|13|2.6x| > |1.4.2_17 (32 bit)|French|716|243|14|3.1x| > |1.4.2_17 (32 bit)|German|669|264|16|2.6x| > |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x| > |1.5.0_15 (32 bit)|English|604|176|16|3.7x| > |1.5.0_15 (32 bit)|French|817|209|17|4.2x| > |1.5.0_15 (32 bit)|German|799|225|20|3.8x| > |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x| > |1.5.0_15 (64 bit)|English|431|89|10|5.3x| > |1.5.0_15 (64 bit)|French|562|112|11|5.5x| > |1.5.0_15 (64 bit)|German|567|116|13|5.4x| > |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x| > |1.6.0_13 (64 bit)|English|162|81|9|2.1x| > |1.6.0_13 (64 bit)|French|192|92|10|2.2x| > |1.6.0_13 (64 bit)|German|204|99|14|2.2x| > |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x| -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org