On 4/27/07, Davi Arnaut wrote: > On 27/04/2007, at 00:14, Lucian Adrian Grijincu wrote: > > > in apr-conv10-faster.patch you added: > > > > static const char digits[] = "0123456789"; > > *--p = digits[magnitude % 10]; > > > > Why is this faster than: > > *--p = (char) '0' + (magnitude % 10); ? > > You have to take into account the entire loop. The fowling: > > do { > u_widest_int new_magnitude = magnitude / 10; > *--p = (char) (magnitude - new_magnitude * 10 + '0'); > magnitude = new_magnitude; > } while (magnitude); > > against: > > do { > *--p = digits[magnitude % 10]; > } while ((magnitude /= 10) != 0); > > digits is easily cacheable, fewer assignments. > I wasn't promoting the first version, I wanted to say the second one could be made better this way: do { *--p = (char) '0' + (magnitude % 10); } while ((magnitude /= 10) != 0); I don't quite like u_widest_int from the first version from a portability standpoint, but I see that it's quite used in conv_10_quad() and the rest :( http://svn.apache.org/viewvc/apr/apr/trunk/strings/apr_snprintf.c?view=markup -- Lucian Adrian Grijincu > > > > For your "faster" version, under the hood, the C compiler adds > > (magnitude % 10) to the address of digits and then copies the contents > > of the memory location represented by the sum's result into *--p. > > > > My version just adds (magnitude % 10) to '0' and stores the result > > in *--p. > > Talk is cheap, let's benchmark! To see the generated assembly: > > gcc -O2 -o bench bench.c -g > objdump -S bench > bench-asm > > # Intel(R) Celeron(R) CPU 2.20GHz > [davi@montefiori ~]$ gcc -o bench bench.c -O2 # uint32_t > [davi@montefiori ~]$ ./bench $RANDOM > conv_1 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > conv_2 > cycles: 236 > cycles: 220 > cycles: 224 > cycles: 220 > cycles: 224 > cycles: 224 > cycles: 224 > cycles: 224 > conv_1 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > cycles: 236 > conv_2 > cycles: 220 > cycles: 224 > cycles: 224 > cycles: 224 > cycles: 224 > cycles: 224 > cycles: 224 > cycles: 224 > > [davi@montefiori ~]$ gcc -o bench bench.c -O2 # uint64_t > [davi@montefiori ~]$ ./bench $RANDOM |more > conv_1 > cycles: 508 > cycles: 532 > cycles: 540 > cycles: 468 > cycles: 468 > cycles: 468 > cycles: 468 > conv_2 > cycles: 1188 > cycles: 824 > cycles: 896 > cycles: 828 > cycles: 824 > cycles: 824 > cycles: 820 > conv_1 > cycles: 524 > cycles: 492 > cycles: 468 > cycles: 504 > cycles: 468 > cycles: 504 > cycles: 468 > conv_2 > cycles: 768 > cycles: 836 > cycles: 836 > cycles: 820 > cycles: 820 > cycles: 820 > cycles: 820 > > > > Am I missing something here? > > Both code, after compiler optimizations, yield similar results but > hurts uint64_t (apr_uint64_t) case quite a bit. "Faster" was a > overstatement, I withdraw apr-conv10-faster.patch. > > -- > Davi Arnaut > > >