On 27/04/2007, at 00:14, Lucian Adrian Grijincu wrote:
> in apr-conv10-faster.patch you added:
>
> static const char digits[] = "0123456789";
> *--p = digits[magnitude % 10];
>
> Why is this faster than:
> *--p = (char) '0' + (magnitude % 10); ?
You have to take into account the entire loop. The fowling:
do {
u_widest_int new_magnitude = magnitude / 10;
*--p = (char) (magnitude - new_magnitude * 10 + '0');
magnitude = new_magnitude;
} while (magnitude);
against:
do {
*--p = digits[magnitude % 10];
} while ((magnitude /= 10) != 0);
digits is easily cacheable, fewer assignments.
>
> For your "faster" version, under the hood, the C compiler adds
> (magnitude % 10) to the address of digits and then copies the contents
> of the memory location represented by the sum's result into *--p.
>
> My version just adds (magnitude % 10) to '0' and stores the result
> in *--p.
Talk is cheap, let's benchmark! To see the generated assembly:
gcc -O2 -o bench bench.c -g
objdump -S bench > bench-asm
# Intel(R) Celeron(R) CPU 2.20GHz
[davi@montefiori ~]$ gcc -o bench bench.c -O2 # uint32_t
[davi@montefiori ~]$ ./bench $RANDOM
conv_1
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
conv_2
cycles: 236
cycles: 220
cycles: 224
cycles: 220
cycles: 224
cycles: 224
cycles: 224
cycles: 224
conv_1
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
conv_2
cycles: 220
cycles: 224
cycles: 224
cycles: 224
cycles: 224
cycles: 224
cycles: 224
cycles: 224
[davi@montefiori ~]$ gcc -o bench bench.c -O2 # uint64_t
[davi@montefiori ~]$ ./bench $RANDOM |more
conv_1
cycles: 508
cycles: 532
cycles: 540
cycles: 468
cycles: 468
cycles: 468
cycles: 468
conv_2
cycles: 1188
cycles: 824
cycles: 896
cycles: 828
cycles: 824
cycles: 824
cycles: 820
conv_1
cycles: 524
cycles: 492
cycles: 468
cycles: 504
cycles: 468
cycles: 504
cycles: 468
conv_2
cycles: 768
cycles: 836
cycles: 836
cycles: 820
cycles: 820
cycles: 820
cycles: 820
> Am I missing something here?
Both code, after compiler optimizations, yield similar results but
hurts uint64_t (apr_uint64_t) case quite a bit. "Faster" was a
overstatement, I withdraw apr-conv10-faster.patch.
--
Davi Arnaut