Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B315E40A for ; Tue, 20 Nov 2012 16:17:02 +0000 (UTC) Received: (qmail 29504 invoked by uid 500); 20 Nov 2012 16:17:00 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 29427 invoked by uid 500); 20 Nov 2012 16:17:00 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 29279 invoked by uid 99); 20 Nov 2012 16:17:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Nov 2012 16:17:00 +0000 Date: Tue, 20 Nov 2012 16:17:00 +0000 (UTC) From: "Adrien Grand (JIRA)" To: dev@lucene.apache.org Message-ID: <470141108.7397.1353428220471.JavaMail.jiratomcat@arcas> Subject: [jira] [Resolved] (LUCENE-2221) Micro-benchmarks for ntz and pop (BitUtils) operations. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-2221. ---------------------------------- Resolution: Fixed bq. Thanks Adrien. Thanks Dawid for having done the hard work! I'll give a look at Mike's benchs tonight to make sure this change didn't make anything worse... > Micro-benchmarks for ntz and pop (BitUtils) operations. > ------------------------------------------------------- > > Key: LUCENE-2221 > URL: https://issues.apache.org/jira/browse/LUCENE-2221 > Project: Lucene - Core > Issue Type: Task > Components: core/other > Reporter: Dawid Weiss > Assignee: Adrien Grand > Priority: Trivial > Attachments: benchmark.jar, benchmarks.txt, LUCENE-2221.patch, lucene-bitset-benchmarks.zip, results-popntz.txt > > > As suggested by Yonik, I performed a suite of micro-benchmarks to investigate the following: > * pop() (bitCount) seems to be implemented in the same way ("hacker's delight") as in the BitUtils class (SUN's standard library from version 1.5). > * native intrinsics have been recently added to the HotSpot that should speed up bitCount significantly. > I have tried to run the code on various VMs and architectures, but of course the results may vary depending on the setting. > h2. Micro-benchmark code, evaluation > * The micro-benchmarks were written as JUnit tests with a custom set of rules that repeats each test measuring execution time, gc use, etc. > * There were 5 warmup runs before each test, followed by 15 benchmarked runs. The result contain overall times, round times and standard deviations where applicable. > * There were several tests for isolated performance of {{BitUtil.pop()}}, JDK's {{bitCount}}, {{BitUtil.ntz()}}, {{BitUtil.ntz2()}}, {{BitUtil.ntz3()}} and JDK's {{numberOfTrailingZeros}}, the test code had the following loop: > {code} > final long [] longs = NtzPopBenchmark.random.bits; > int cnt = 0; > for (int i = longs.length; --i >= 0;) > { > cnt += Long.bitCount(longs[i]); > } > volatileVariable = cnt; // to prevent dead code removal. > {code} > * I also added another version of pop() based on a precomputed bit counts. This version was called {{pop2}}. > * The input array of long values was initialized to a memory taking 200MB. There were two different sets: {{random}} (random values) and {{single}} (single bit rotated to the right in each long). > * There were tests that called {{BitUtil.pop_xor}} between the two input bitsets (random, single). > * Additional tests that used iterators and and other BitUtil operations showed similar performance to isolated loops, so I omit them here. > h2. Evaluation environment > I tested on three different machines: > * Pentium 4, 32-bit, 3GHZ, 2GB RAM (Windows) > * AMD Athlon(tm) 64 X2 Dual Core Processor 5200+, 64-bit, 4GB RAM (Ubuntu) > * Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz, 64-bit, 4GB RAM (Ubuntu) > and on various VMs: > * 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., > * 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., > * 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., > * 1.6.0, IBM J9 VM, 2.4, IBM Corporation, > * BEA JRockit. > * (ant other minor versions of the VMs above, depending on the computer). > h2. Results overview > h3. {{pop}} > The times between {{BitUtil}} and JDK were mostly identical. However, on 32-bit systems, precached {{pop2}} performed > *much* better. Examples: > {noformat} > # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., > test_POP_JDK_random : 15/20 rounds, time.total: 15.61, time.warmup: 4.31, time.bench: 11.30, round: 0.75 [+- 0.02] > test_POP_JDK_single : 15/20 rounds, time.total: 15.67, time.warmup: 4.31, time.bench: 11.36, round: 0.76 [+- 0.02] > test_POP_BitUtil_random : 15/20 rounds, time.total: 15.55, time.warmup: 4.33, time.bench: 11.22, round: 0.75 [+- 0.01] > test_POP_BitUtil_single : 15/20 rounds, time.total: 15.55, time.warmup: 4.31, time.bench: 11.23, round: 0.75 [+- 0.01] > test_POP2_random : 15/20 rounds, time.total: 6.69, time.warmup: 1.75, time.bench: 4.94, round: 0.33 [+- 0.00] > test_POP2_single : 15/20 rounds, time.total: 4.66, time.warmup: 1.22, time.bench: 3.44, round: 0.23 [+- 0.01] > {noformat} > Note the difference between random and single distributions -- most probably due to more cache hits when referring to the > lookup table. Other VMs on this 32-bit machine: > {noformat} > # 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., > test_POP_JDK_random : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01] > test_POP_JDK_single : 15/20 rounds, time.total: 22.70, time.warmup: 5.63, time.bench: 17.08, round: 1.14 [+- 0.01] > test_POP_BitUtil_random : 15/20 rounds, time.total: 22.69, time.warmup: 5.63, time.bench: 17.06, round: 1.14 [+- 0.01] > test_POP_BitUtil_single : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01] > test_POP2_random : 15/20 rounds, time.total: 6.30, time.warmup: 1.63, time.bench: 4.67, round: 0.31 [+- 0.01] > test_POP2_single : 15/20 rounds, time.total: 4.33, time.warmup: 1.16, time.bench: 3.17, round: 0.21 [+- 0.01] > # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., > test_POP_JDK_random : 15/20 rounds, time.total: 15.28, time.warmup: 4.25, time.bench: 11.03, round: 0.74 [+- 0.03] > test_POP_JDK_single : 15/20 rounds, time.total: 15.16, time.warmup: 4.20, time.bench: 10.95, round: 0.73 [+- 0.01] > test_POP_BitUtil_random : 15/20 rounds, time.total: 15.12, time.warmup: 4.20, time.bench: 10.92, round: 0.73 [+- 0.01] > test_POP_BitUtil_single : 15/20 rounds, time.total: 15.13, time.warmup: 4.25, time.bench: 10.88, round: 0.73 [+- 0.01] > test_POP2_random : 15/20 rounds, time.total: 6.78, time.warmup: 1.72, time.bench: 5.06, round: 0.34 [+- 0.01] > test_POP2_single : 15/20 rounds, time.total: 4.72, time.warmup: 1.20, time.bench: 3.52, round: 0.23 [+- 0.02] > {noformat} > On 64-bit machines, the results are nearly equal, with pop2 performing slightly worse on SUN's 1.6 compared to JDK and BitUtil > (but this difference is really tiny and not present on all VMs; see IBM J9 and SUN's 1.5 below). > {noformat} > # 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc., > test_POP_JDK_random : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00] > test_POP_JDK_single : 15/20 rounds, time.total: 3.11, time.warmup: 0.76, time.bench: 2.34, round: 0.16 [+- 0.02] > test_POP_BitUtil_random : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00] > test_POP_BitUtil_single : 15/20 rounds, time.total: 3.03, time.warmup: 0.77, time.bench: 2.26, round: 0.15 [+- 0.00] > test_POP2_random : 15/20 rounds, time.total: 3.63, time.warmup: 0.93, time.bench: 2.70, round: 0.18 [+- 0.00] > test_POP2_single : 15/20 rounds, time.total: 3.51, time.warmup: 0.89, time.bench: 2.62, round: 0.17 [+- 0.00] > # 1.6.0, IBM J9 VM, 2.4, IBM Corporation, > test_POP_JDK_random : 15/20 rounds, time.total: 4.80, time.warmup: 1.24, time.bench: 3.57, round: 0.24 [+- 0.01] > test_POP_JDK_single : 15/20 rounds, time.total: 5.00, time.warmup: 1.44, time.bench: 3.56, round: 0.24 [+- 0.01] > test_POP_BitUtil_random : 15/20 rounds, time.total: 4.81, time.warmup: 1.24, time.bench: 3.56, round: 0.24 [+- 0.01] > test_POP_BitUtil_single : 15/20 rounds, time.total: 4.75, time.warmup: 1.19, time.bench: 3.56, round: 0.24 [+- 0.01] > test_POP2_random : 15/20 rounds, time.total: 3.65, time.warmup: 0.90, time.bench: 2.76, round: 0.18 [+- 0.00] > test_POP2_single : 15/20 rounds, time.total: 3.82, time.warmup: 0.93, time.bench: 2.89, round: 0.19 [+- 0.01] > # 1.5.0_18, Java HotSpot(TM) 64-Bit Server VM, 1.5.0_18-b02, Sun Microsystems Inc., > test_POP_JDK_random : 15/20 rounds, time.total: 3.72, time.warmup: 0.94, time.bench: 2.78, round: 0.19 [+- 0.00] > test_POP_JDK_single : 15/20 rounds, time.total: 5.96, time.warmup: 1.40, time.bench: 4.56, round: 0.30 [+- 0.00] > test_POP_BitUtil_random : 15/20 rounds, time.total: 6.16, time.warmup: 1.43, time.bench: 4.73, round: 0.31 [+- 0.00] > test_POP_BitUtil_single : 15/20 rounds, time.total: 3.62, time.warmup: 0.92, time.bench: 2.70, round: 0.18 [+- 0.00] > test_POP2_random : 15/20 rounds, time.total: 3.70, time.warmup: 0.96, time.bench: 2.74, round: 0.18 [+- 0.00] > test_POP2_single : 15/20 rounds, time.total: 3.57, time.warmup: 0.93, time.bench: 2.65, round: 0.18 [+- 0.00] > {noformat} > The other 64-bit machine (quad-core): > {noformat} > # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., > test_POP_JDK_random : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00] > test_POP_JDK_single : 15/20 rounds, time.total: 2.49, time.warmup: 0.62, time.bench: 1.87, round: 0.12 [+- 0.01] > test_POP_BitUtil_random : 15/20 rounds, time.total: 2.47, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00] > test_POP_BitUtil_single : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00] > test_POP2_random : 15/20 rounds, time.total: 2.82, time.warmup: 0.71, time.bench: 2.11, round: 0.14 [+- 0.00] > test_POP2_single : 15/20 rounds, time.total: 2.15, time.warmup: 0.55, time.bench: 1.61, round: 0.11 [+- 0.00] > {noformat} > I then replaced {{BitUtil.pop}} with {{BitUtil.pop2}} in bit-counting methods like xor/and/or. The results are intriguing. > On 32-bit systems, there is a measureable gain, like here: > {noformat} > # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., > test_pop_xor : 15/20 rounds, time.total: 9.78, time.warmup: 2.59, time.bench: 7.19, round: 0.48 [+- 0.01] > test_pop2_hd_xor : 15/20 rounds, time.total: 8.27, time.warmup: 2.22, time.bench: 6.05, round: 0.40 [+- 0.01] > # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., > test_pop_xor : 15/20 rounds, time.total: 9.89, time.warmup: 2.59, time.bench: 7.30, round: 0.49 [+- 0.02] > test_pop2_hd_xor : 15/20 rounds, time.total: 8.20, time.warmup: 2.24, time.bench: 5.97, round: 0.40 [+- 0.01] > {noformat} > On 64-bit systems, when 64-bit values can be manipulated directly in registers, there was nearly no speedup or even > a small performance penalty like in here: > {noformat} > # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., > test_pop_xor : 15/20 rounds, time.total: 1.76, time.warmup: 0.49, time.bench: 1.27, round: 0.09 [+- 0.00] > test_pop2_hd_xor : 15/20 rounds, time.total: 2.06, time.warmup: 0.55, time.bench: 1.51, round: 0.10 [+- 0.00] > {noformat} > I'm guessing referencing memory on this fast processors is slower than manipulating registers. > h3. {{ntz}} > On JVMs prior to 1.7, the {{ntz} version from Lucene was much faster in my tests than the one from JDK, > but it also has a greater variance depending on the input bits' distribution (compare the same routine > for random and single below). > {noformat} > # 32-bit system; > # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., > test_NTZ_JDK_random : 15/20 rounds, time.total: 6.69, time.warmup: 1.73, time.bench: 4.95, round: 0.33 [+- 0.01] > test_NTZ_JDK_single : 15/20 rounds, time.total: 7.59, time.warmup: 1.94, time.bench: 5.66, round: 0.38 [+- 0.01] > test_NTZ_BitUtil_random : 15/20 rounds, time.total: 2.72, time.warmup: 0.73, time.bench: 1.98, round: 0.13 [+- 0.02] > test_NTZ_BitUtil_single : 15/20 rounds, time.total: 5.28, time.warmup: 1.34, time.bench: 3.94, round: 0.26 [+- 0.02] > test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 3.06, time.warmup: 0.81, time.bench: 2.25, round: 0.15 [+- 0.01] > test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 5.36, time.warmup: 1.34, time.bench: 4.02, round: 0.27 [+- 0.01] > test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 5.80, time.warmup: 1.48, time.bench: 4.31, round: 0.29 [+- 0.01] > test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 6.98, time.warmup: 1.81, time.bench: 5.17, round: 0.34 [+- 0.01] > # 64-bit Athlon > # 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc., > test_NTZ_JDK_random : 15/20 rounds, time.total: 4.59, time.warmup: 1.16, time.bench: 3.44, round: 0.23 [+- 0.00] > test_NTZ_JDK_single : 15/20 rounds, time.total: 6.64, time.warmup: 1.59, time.bench: 5.04, round: 0.34 [+- 0.01] > test_NTZ_BitUtil_random : 15/20 rounds, time.total: 2.09, time.warmup: 0.53, time.bench: 1.56, round: 0.10 [+- 0.00] > test_NTZ_BitUtil_single : 15/20 rounds, time.total: 3.87, time.warmup: 0.98, time.bench: 2.90, round: 0.19 [+- 0.00] > test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 2.09, time.warmup: 0.52, time.bench: 1.57, round: 0.10 [+- 0.00] > test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 3.31, time.warmup: 0.84, time.bench: 2.47, round: 0.16 [+- 0.00] > test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 3.31, time.warmup: 0.83, time.bench: 2.48, round: 0.17 [+- 0.00] > test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 5.71, time.warmup: 1.39, time.bench: 4.32, round: 0.29 [+- 0.00] > {noformat} > But then comes the 1.7 HotSport and things change radically, on 32-bit system the JDK's version is much faster for nearly-empty > {{long}} values: > {noformat} > # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., > test_NTZ_JDK_random : 15/20 rounds, time.total: 1.97, time.warmup: 0.61, time.bench: 1.36, round: 0.09 [+- 0.01] > test_NTZ_JDK_single : 15/20 rounds, time.total: 2.53, time.warmup: 0.77, time.bench: 1.77, round: 0.12 [+- 0.01] > test_NTZ_BitUtil_random : 15/20 rounds, time.total: 2.36, time.warmup: 0.66, time.bench: 1.70, round: 0.11 [+- 0.01] > test_NTZ_BitUtil_single : 15/20 rounds, time.total: 4.50, time.warmup: 1.19, time.bench: 3.31, round: 0.22 [+- 0.01] > test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 3.08, time.warmup: 0.81, time.bench: 2.27, round: 0.15 [+- 0.01] > test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 4.97, time.warmup: 1.28, time.bench: 3.69, round: 0.25 [+- 0.01] > test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 5.78, time.warmup: 1.48, time.bench: 4.30, round: 0.29 [+- 0.01] > test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 7.77, time.warmup: 1.91, time.bench: 5.86, round: 0.39 [+- 0.01] > {noformat} > On the 64-bit quad core: > {noformat} > # 1.6.0_13, Java HotSpot(TM) 64-Bit Server VM, 11.3-b02, Sun Microsystems Inc., > test_NTZ_JDK_random : 15/20 rounds, time.total: 3.92, time.warmup: 0.97, time.bench: 2.94, round: 0.20 [+- 0.00] > test_NTZ_JDK_single : 15/20 rounds, time.total: 3.80, time.warmup: 0.97, time.bench: 2.82, round: 0.19 [+- 0.00] > test_NTZ_BitUtil_random : 15/20 rounds, time.total: 0.96, time.warmup: 0.25, time.bench: 0.71, round: 0.05 [+- 0.00] > test_NTZ_BitUtil_single : 15/20 rounds, time.total: 2.74, time.warmup: 0.69, time.bench: 2.04, round: 0.14 [+- 0.00] > test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 1.22, time.warmup: 0.31, time.bench: 0.91, round: 0.06 [+- 0.00] > test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 2.18, time.warmup: 0.56, time.bench: 1.62, round: 0.11 [+- 0.00] > test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 2.76, time.warmup: 0.71, time.bench: 2.06, round: 0.14 [+- 0.00] > test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 3.47, time.warmup: 0.91, time.bench: 2.56, round: 0.17 [+- 0.01] > {noformat} > And then comes the 1.7, compare JDK's implementation with anything else (especially the {{time.bench}} for > the {{single}} input data. Looks like this is hardware-accelerated. > {noformat} > # -server -Xbatch -Xmx1024m > # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., > test_NTZ_JDK_random : 15/20 rounds, time.total: 0.79, time.warmup: 0.21, time.bench: 0.58, round: 0.04 [+- 0.00] > test_NTZ_JDK_single : 15/20 rounds, time.total: 0.75, time.warmup: 0.20, time.bench: 0.55, round: 0.04 [+- 0.00] > test_NTZ_BitUtil_random : 15/20 rounds, time.total: 0.98, time.warmup: 0.25, time.bench: 0.72, round: 0.05 [+- 0.00] > test_NTZ_BitUtil_single : 15/20 rounds, time.total: 2.61, time.warmup: 0.66, time.bench: 1.95, round: 0.13 [+- 0.00] > test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 1.30, time.warmup: 0.33, time.bench: 0.97, round: 0.06 [+- 0.00] > test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 2.48, time.warmup: 0.61, time.bench: 1.88, round: 0.13 [+- 0.00] > test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 2.81, time.warmup: 0.70, time.bench: 2.11, round: 0.14 [+- 0.00] > test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 4.07, time.warmup: 1.02, time.bench: 3.05, round: 0.20 [+- 0.00] > {noformat} > h2. Conclusions > It seems that any change introduced to these routines will hurt somebody in some configuration, so it's really hard > for me to make choices. I would definitely opt for the precached {{pop2}} version on 32-bit systems as it seems to > be always faster or equally fast compared to other bit counting options. {{pop2}} looked like this: > {code} > private static byte [] bcounts = new byte [0x10000]; > static > { > for (int i = 0x10000; --i >= 0;) > bcounts[i] = (byte) Integer.bitCount(i); > } > public static int pop2(long v) > { > int t; > return > bcounts[(t = (int) v) & 0xffff] > + bcounts[t >>> 16] > + bcounts[(t = ((int) (v >>> 32))) & 0xffff] > + bcounts[t >>> 16]; > } > {code} > As for the hardware-accelerated {{ntz}}, if one can detect 1.7, then using the JDK's version is speeding up things > significantly. But I have not checked how this detection would affect speed if done at run-time (I assume a final > static flag wouldn't cause any performance penalty) and it is definitely not worth replacing for folks with older > VMs. > h2. Raw results data. > I will attach raw results as part of the issue if you want to draw your own conclusions. Didn't have access to sparc-machine > or to any machine with the newest Intels. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org