On 27/04/2007, at 00:14, Lucian Adrian Grijincu wrote:
> in aprconv10faster.patch you added:
>
> static const char digits[] = "0123456789";
> *p = digits[magnitude % 10];
>
> Why is this faster than:
> *p = (char) '0' + (magnitude % 10); ?
You have to take into account the entire loop. The fowling:
do {
u_widest_int new_magnitude = magnitude / 10;
*p = (char) (magnitude  new_magnitude * 10 + '0');
magnitude = new_magnitude;
} while (magnitude);
against:
do {
*p = digits[magnitude % 10];
} while ((magnitude /= 10) != 0);
digits is easily cacheable, fewer assignments.
>
> For your "faster" version, under the hood, the C compiler adds
> (magnitude % 10) to the address of digits and then copies the contents
> of the memory location represented by the sum's result into *p.
>
> My version just adds (magnitude % 10) to '0' and stores the result
> in *p.
Talk is cheap, let's benchmark! To see the generated assembly:
gcc O2 o bench bench.c g
objdump S bench > benchasm
# Intel(R) Celeron(R) CPU 2.20GHz
[davi@montefiori ~]$ gcc o bench bench.c O2 # uint32_t
[davi@montefiori ~]$ ./bench $RANDOM
conv_1
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
conv_2
cycles: 236
cycles: 220
cycles: 224
cycles: 220
cycles: 224
cycles: 224
cycles: 224
cycles: 224
conv_1
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
cycles: 236
conv_2
cycles: 220
cycles: 224
cycles: 224
cycles: 224
cycles: 224
cycles: 224
cycles: 224
cycles: 224
[davi@montefiori ~]$ gcc o bench bench.c O2 # uint64_t
[davi@montefiori ~]$ ./bench $RANDOM more
conv_1
cycles: 508
cycles: 532
cycles: 540
cycles: 468
cycles: 468
cycles: 468
cycles: 468
conv_2
cycles: 1188
cycles: 824
cycles: 896
cycles: 828
cycles: 824
cycles: 824
cycles: 820
conv_1
cycles: 524
cycles: 492
cycles: 468
cycles: 504
cycles: 468
cycles: 504
cycles: 468
conv_2
cycles: 768
cycles: 836
cycles: 836
cycles: 820
cycles: 820
cycles: 820
cycles: 820
> Am I missing something here?
Both code, after compiler optimizations, yield similar results but
hurts uint64_t (apr_uint64_t) case quite a bit. "Faster" was a
overstatement, I withdraw aprconv10faster.patch.

Davi Arnaut
