Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Mon, 19 Nov 2012 11:59:01 +0000 (UTC)
From: "Adrien Grand (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <1983825989.2533.1353326341662.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (LUCENE-2221) Micro-benchmarks for ntz and pop
 (BitUtils) operations.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500182#comment-13500182 ] 

Adrien Grand commented on LUCENE-2221:
--------------------------------------

Now that we have dropped support for Java 5, maybe it would make sense to make Lucene use the JDK impl of ntz? According to the release notes, the numberOfTrailingZeros method was made an intrinsic in Java 6u18[1] which is nearly 3 years old now, so this sounds like a safe bet?

 [1] http://www.oracle.com/technetwork/java/javase/6u18-142093.html
                
> Micro-benchmarks for ntz and pop (BitUtils) operations.
> -------------------------------------------------------
>
>                 Key: LUCENE-2221
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2221
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/other
>            Reporter: Dawid Weiss
>            Priority: Trivial
>         Attachments: benchmark.jar, benchmarks.txt, lucene-bitset-benchmarks.zip, results-popntz.txt
>
>
> As suggested by Yonik, I performed a suite of micro-benchmarks to investigate the following:
> * pop() (bitCount) seems to be implemented in the same way ("hacker's delight") as in the BitUtils class (SUN's standard library from version 1.5).
> * native intrinsics have been recently added to the HotSpot that should speed up bitCount significantly.
> I have tried to run the code on various VMs and architectures, but of course the results may vary depending on the setting.
> h2. Micro-benchmark code, evaluation
> * The micro-benchmarks were written as JUnit tests with a custom set of rules that repeats each test measuring execution time, gc use, etc.
> * There were 5 warmup runs before each test, followed by 15 benchmarked runs. The result contain overall times, round times and standard deviations where applicable.
> * There were several tests for isolated performance of {{BitUtil.pop()}}, JDK's {{bitCount}}, {{BitUtil.ntz()}}, {{BitUtil.ntz2()}}, {{BitUtil.ntz3()}} and JDK's {{numberOfTrailingZeros}}, the test code had the following loop:
> {code}
> final long [] longs = NtzPopBenchmark.random.bits;
> int cnt = 0;
> for (int i = longs.length; --i >= 0;)
> {
>    cnt += Long.bitCount(longs[i]);
> }
> volatileVariable = cnt; // to prevent dead code removal.
> {code}
> * I also added another version of pop() based on a precomputed bit counts. This version was called {{pop2}}.
> * The input array of long values was initialized to a memory taking 200MB. There were two different sets: {{random}} (random values) and {{single}} (single bit rotated to the right in each long).
> * There were tests that called {{BitUtil.pop_xor}} between the two input bitsets (random, single).
> * Additional tests that used iterators and and other BitUtil operations showed similar performance to isolated loops, so I omit them here.
> h2. Evaluation environment
> I tested on three different machines:
> * Pentium 4, 32-bit, 3GHZ, 2GB RAM (Windows)
> * AMD Athlon(tm) 64 X2 Dual Core Processor 5200+, 64-bit, 4GB RAM (Ubuntu)
> * Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz, 64-bit, 4GB RAM (Ubuntu)
> and on various VMs:
> * 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
> * 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., 
> * 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
> * 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
> * BEA JRockit.
> * (ant other minor versions of the VMs above, depending on the computer).
> h2. Results overview
> h3. {{pop}}
> The times between {{BitUtil}} and JDK were mostly identical. However, on 32-bit systems, precached {{pop2}} performed
> *much* better. Examples:
> {noformat}
> # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
> test_POP_JDK_random       : 15/20 rounds, time.total: 15.61, time.warmup: 4.31, time.bench: 11.30, round: 0.75 [+- 0.02]
> test_POP_JDK_single       : 15/20 rounds, time.total: 15.67, time.warmup: 4.31, time.bench: 11.36, round: 0.76 [+- 0.02]
> test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.55, time.warmup: 4.33, time.bench: 11.22, round: 0.75 [+- 0.01]
> test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.55, time.warmup: 4.31, time.bench: 11.23, round: 0.75 [+- 0.01]
> test_POP2_random          : 15/20 rounds, time.total:  6.69, time.warmup: 1.75, time.bench:  4.94, round: 0.33 [+- 0.00]
> test_POP2_single          : 15/20 rounds, time.total:  4.66, time.warmup: 1.22, time.bench:  3.44, round: 0.23 [+- 0.01]
> {noformat}
> Note the difference between random and single distributions -- most probably due to more cache hits when referring to the
> lookup table. Other VMs on this 32-bit machine:
> {noformat}
> # 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., 
> test_POP_JDK_random       : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01]
> test_POP_JDK_single       : 15/20 rounds, time.total: 22.70, time.warmup: 5.63, time.bench: 17.08, round: 1.14 [+- 0.01]
> test_POP_BitUtil_random   : 15/20 rounds, time.total: 22.69, time.warmup: 5.63, time.bench: 17.06, round: 1.14 [+- 0.01]
> test_POP_BitUtil_single   : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01]
> test_POP2_random          : 15/20 rounds, time.total:  6.30, time.warmup: 1.63, time.bench:  4.67, round: 0.31 [+- 0.01]
> test_POP2_single          : 15/20 rounds, time.total:  4.33, time.warmup: 1.16, time.bench:  3.17, round: 0.21 [+- 0.01]
> # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
> test_POP_JDK_random       : 15/20 rounds, time.total: 15.28, time.warmup: 4.25, time.bench: 11.03, round: 0.74 [+- 0.03]
> test_POP_JDK_single       : 15/20 rounds, time.total: 15.16, time.warmup: 4.20, time.bench: 10.95, round: 0.73 [+- 0.01]
> test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.12, time.warmup: 4.20, time.bench: 10.92, round: 0.73 [+- 0.01]
> test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.13, time.warmup: 4.25, time.bench: 10.88, round: 0.73 [+- 0.01]
> test_POP2_random          : 15/20 rounds, time.total:  6.78, time.warmup: 1.72, time.bench:  5.06, round: 0.34 [+- 0.01]
> test_POP2_single          : 15/20 rounds, time.total:  4.72, time.warmup: 1.20, time.bench:  3.52, round: 0.23 [+- 0.02]
> {noformat}
> On 64-bit machines, the results are nearly equal, with pop2 performing slightly worse on SUN's 1.6 compared to JDK and BitUtil 
> (but this difference is really tiny and not present on all VMs; see IBM J9 and SUN's 1.5 below).
> {noformat}
> # 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
> test_POP_JDK_random       : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00]
> test_POP_JDK_single       : 15/20 rounds, time.total: 3.11, time.warmup: 0.76, time.bench: 2.34, round: 0.16 [+- 0.02]
> test_POP_BitUtil_random   : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00]
> test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.03, time.warmup: 0.77, time.bench: 2.26, round: 0.15 [+- 0.00]
> test_POP2_random          : 15/20 rounds, time.total: 3.63, time.warmup: 0.93, time.bench: 2.70, round: 0.18 [+- 0.00]
> test_POP2_single          : 15/20 rounds, time.total: 3.51, time.warmup: 0.89, time.bench: 2.62, round: 0.17 [+- 0.00]
> # 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
> test_POP_JDK_random       : 15/20 rounds, time.total: 4.80, time.warmup: 1.24, time.bench: 3.57, round: 0.24 [+- 0.01]
> test_POP_JDK_single       : 15/20 rounds, time.total: 5.00, time.warmup: 1.44, time.bench: 3.56, round: 0.24 [+- 0.01]
> test_POP_BitUtil_random   : 15/20 rounds, time.total: 4.81, time.warmup: 1.24, time.bench: 3.56, round: 0.24 [+- 0.01]
> test_POP_BitUtil_single   : 15/20 rounds, time.total: 4.75, time.warmup: 1.19, time.bench: 3.56, round: 0.24 [+- 0.01]
> test_POP2_random          : 15/20 rounds, time.total: 3.65, time.warmup: 0.90, time.bench: 2.76, round: 0.18 [+- 0.00]
> test_POP2_single          : 15/20 rounds, time.total: 3.82, time.warmup: 0.93, time.bench: 2.89, round: 0.19 [+- 0.01]
> # 1.5.0_18, Java HotSpot(TM) 64-Bit Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
> test_POP_JDK_random       : 15/20 rounds, time.total: 3.72, time.warmup: 0.94, time.bench: 2.78, round: 0.19 [+- 0.00]
> test_POP_JDK_single       : 15/20 rounds, time.total: 5.96, time.warmup: 1.40, time.bench: 4.56, round: 0.30 [+- 0.00]
> test_POP_BitUtil_random   : 15/20 rounds, time.total: 6.16, time.warmup: 1.43, time.bench: 4.73, round: 0.31 [+- 0.00]
> test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.62, time.warmup: 0.92, time.bench: 2.70, round: 0.18 [+- 0.00]
> test_POP2_random          : 15/20 rounds, time.total: 3.70, time.warmup: 0.96, time.bench: 2.74, round: 0.18 [+- 0.00]
> test_POP2_single          : 15/20 rounds, time.total: 3.57, time.warmup: 0.93, time.bench: 2.65, round: 0.18 [+- 0.00]
> {noformat}
> The other 64-bit machine (quad-core):
> {noformat}
> # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
> test_POP_JDK_random       : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
> test_POP_JDK_single       : 15/20 rounds, time.total: 2.49, time.warmup: 0.62, time.bench: 1.87, round: 0.12 [+- 0.01]
> test_POP_BitUtil_random   : 15/20 rounds, time.total: 2.47, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
> test_POP_BitUtil_single   : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
> test_POP2_random          : 15/20 rounds, time.total: 2.82, time.warmup: 0.71, time.bench: 2.11, round: 0.14 [+- 0.00]
> test_POP2_single          : 15/20 rounds, time.total: 2.15, time.warmup: 0.55, time.bench: 1.61, round: 0.11 [+- 0.00]
> {noformat}
> I then replaced {{BitUtil.pop}} with {{BitUtil.pop2}} in bit-counting methods like xor/and/or. The results are intriguing.
> On 32-bit systems, there is a measureable gain, like here:
> {noformat}
> # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
> test_pop_xor              : 15/20 rounds, time.total:  9.78, time.warmup: 2.59, time.bench:  7.19, round: 0.48 [+- 0.01]
> test_pop2_hd_xor          : 15/20 rounds, time.total:  8.27, time.warmup: 2.22, time.bench:  6.05, round: 0.40 [+- 0.01]
> # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
> test_pop_xor              : 15/20 rounds, time.total:  9.89, time.warmup: 2.59, time.bench: 7.30, round: 0.49 [+- 0.02]
> test_pop2_hd_xor          : 15/20 rounds, time.total:  8.20, time.warmup: 2.24, time.bench: 5.97, round: 0.40 [+- 0.01]
> {noformat}
> On 64-bit systems, when 64-bit values can be manipulated directly in registers, there was nearly no speedup or even
> a small performance penalty like in here:
> {noformat}
> # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
> test_pop_xor              : 15/20 rounds, time.total: 1.76, time.warmup: 0.49, time.bench: 1.27, round: 0.09 [+- 0.00]
> test_pop2_hd_xor          : 15/20 rounds, time.total: 2.06, time.warmup: 0.55, time.bench: 1.51, round: 0.10 [+- 0.00]
> {noformat}
> I'm guessing referencing memory on this fast processors is slower than manipulating registers.
> h3. {{ntz}}
> On JVMs prior to 1.7, the {{ntz} version from Lucene was much faster in my tests than the one from JDK, 
> but it also has a greater variance depending on the input bits' distribution (compare the same routine 
> for random and single below).
> {noformat}
> # 32-bit system;
> # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
> test_NTZ_JDK_random       : 15/20 rounds, time.total:  6.69, time.warmup: 1.73, time.bench: 4.95, round: 0.33 [+- 0.01]
> test_NTZ_JDK_single       : 15/20 rounds, time.total:  7.59, time.warmup: 1.94, time.bench: 5.66, round: 0.38 [+- 0.01]
> test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.72, time.warmup: 0.73, time.bench: 1.98, round: 0.13 [+- 0.02]
> test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  5.28, time.warmup: 1.34, time.bench: 3.94, round: 0.26 [+- 0.02]
> test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.06, time.warmup: 0.81, time.bench: 2.25, round: 0.15 [+- 0.01]
> test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  5.36, time.warmup: 1.34, time.bench: 4.02, round: 0.27 [+- 0.01]
> test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.80, time.warmup: 1.48, time.bench: 4.31, round: 0.29 [+- 0.01]
> test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  6.98, time.warmup: 1.81, time.bench: 5.17, round: 0.34 [+- 0.01]
> # 64-bit Athlon
> # 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
> test_NTZ_JDK_random       : 15/20 rounds, time.total: 4.59, time.warmup: 1.16, time.bench: 3.44, round: 0.23 [+- 0.00]
> test_NTZ_JDK_single       : 15/20 rounds, time.total: 6.64, time.warmup: 1.59, time.bench: 5.04, round: 0.34 [+- 0.01]
> test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 2.09, time.warmup: 0.53, time.bench: 1.56, round: 0.10 [+- 0.00]
> test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 3.87, time.warmup: 0.98, time.bench: 2.90, round: 0.19 [+- 0.00]
> test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 2.09, time.warmup: 0.52, time.bench: 1.57, round: 0.10 [+- 0.00]
> test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 3.31, time.warmup: 0.84, time.bench: 2.47, round: 0.16 [+- 0.00]
> test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 3.31, time.warmup: 0.83, time.bench: 2.48, round: 0.17 [+- 0.00]
> test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 5.71, time.warmup: 1.39, time.bench: 4.32, round: 0.29 [+- 0.00]
> {noformat}
> But then comes the 1.7 HotSport and things change radically, on 32-bit system the JDK's version is much faster for nearly-empty
> {{long}} values:
> {noformat}
> # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
> test_NTZ_JDK_random       : 15/20 rounds, time.total:  1.97, time.warmup: 0.61, time.bench: 1.36, round: 0.09 [+- 0.01]
> test_NTZ_JDK_single       : 15/20 rounds, time.total:  2.53, time.warmup: 0.77, time.bench: 1.77, round: 0.12 [+- 0.01]
> test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.36, time.warmup: 0.66, time.bench: 1.70, round: 0.11 [+- 0.01]
> test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  4.50, time.warmup: 1.19, time.bench: 3.31, round: 0.22 [+- 0.01]
> test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.08, time.warmup: 0.81, time.bench: 2.27, round: 0.15 [+- 0.01]
> test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  4.97, time.warmup: 1.28, time.bench: 3.69, round: 0.25 [+- 0.01]
> test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.78, time.warmup: 1.48, time.bench: 4.30, round: 0.29 [+- 0.01]
> test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  7.77, time.warmup: 1.91, time.bench: 5.86, round: 0.39 [+- 0.01]
> {noformat}
> On the 64-bit quad core:
> {noformat}
> # 1.6.0_13, Java HotSpot(TM) 64-Bit Server VM, 11.3-b02, Sun Microsystems Inc.,
> test_NTZ_JDK_random       : 15/20 rounds, time.total: 3.92, time.warmup: 0.97, time.bench: 2.94, round: 0.20 [+- 0.00]
> test_NTZ_JDK_single       : 15/20 rounds, time.total: 3.80, time.warmup: 0.97, time.bench: 2.82, round: 0.19 [+- 0.00]
> test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.96, time.warmup: 0.25, time.bench: 0.71, round: 0.05 [+- 0.00]
> test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.74, time.warmup: 0.69, time.bench: 2.04, round: 0.14 [+- 0.00]
> test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.22, time.warmup: 0.31, time.bench: 0.91, round: 0.06 [+- 0.00]
> test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.18, time.warmup: 0.56, time.bench: 1.62, round: 0.11 [+- 0.00]
> test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.76, time.warmup: 0.71, time.bench: 2.06, round: 0.14 [+- 0.00]
> test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 3.47, time.warmup: 0.91, time.bench: 2.56, round: 0.17 [+- 0.01]
> {noformat}
> And then comes the 1.7, compare JDK's implementation with anything else (especially the {{time.bench}} for
> the {{single}} input data. Looks like this is hardware-accelerated.
> {noformat}
> # -server -Xbatch -Xmx1024m
> # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
> test_NTZ_JDK_random       : 15/20 rounds, time.total: 0.79, time.warmup: 0.21, time.bench: 0.58, round: 0.04 [+- 0.00]
> test_NTZ_JDK_single       : 15/20 rounds, time.total: 0.75, time.warmup: 0.20, time.bench: 0.55, round: 0.04 [+- 0.00]
> test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.98, time.warmup: 0.25, time.bench: 0.72, round: 0.05 [+- 0.00]
> test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.61, time.warmup: 0.66, time.bench: 1.95, round: 0.13 [+- 0.00]
> test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.30, time.warmup: 0.33, time.bench: 0.97, round: 0.06 [+- 0.00]
> test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.48, time.warmup: 0.61, time.bench: 1.88, round: 0.13 [+- 0.00]
> test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.81, time.warmup: 0.70, time.bench: 2.11, round: 0.14 [+- 0.00]
> test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 4.07, time.warmup: 1.02, time.bench: 3.05, round: 0.20 [+- 0.00]
> {noformat}
> h2. Conclusions
> It seems that any change introduced to these routines will hurt somebody in some configuration, so it's really hard
> for me to make choices. I would definitely opt for the precached {{pop2}} version on 32-bit systems as it seems to 
> be always faster or equally fast compared to other bit counting options. {{pop2}} looked like this:
> {code}
>    private static byte [] bcounts = new byte [0x10000];
>    static
>    {
>        for (int i = 0x10000; --i >= 0;)
>            bcounts[i] = (byte) Integer.bitCount(i);
>    }
>    public static int pop2(long v)
>    {
>        int t;
>        return 
>              bcounts[(t = (int) v) & 0xffff]
>            + bcounts[t >>> 16]
>            + bcounts[(t = ((int) (v >>> 32))) & 0xffff]
>            + bcounts[t >>> 16];
>    }
> {code}
> As for the hardware-accelerated {{ntz}}, if one can detect 1.7, then using the JDK's version is speeding up things
> significantly. But I have not checked how this detection would affect speed if done at run-time (I assume a final
> static flag wouldn't cause any performance penalty) and it is definitely not worth replacing for folks with older
> VMs.
> h2. Raw results data.
> I will attach raw results as part of the issue if you want to draw your own conclusions. Didn't have access to sparc-machine
> or to any machine with the newest Intels.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org