incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: OO Design -- performance
Date Tue, 31 Oct 2006 04:24:43 GMT

On Oct 23, 2006, at 6:44 PM, David Balmain wrote:
>> The cost of a double-dereference to dispatch a method is
>> insignificant.  However, the syntax is godawful...
>>       len = instream->m->length_i(instream);
>> ... which is why Dave has macro-fied it:
>>       #define is_length(mis) mis->m->length_i(mis)
>> How about if we implement every method call in Lucy that way?
> Sounds good to me. I would have implemented all of Ferret's classes
> this way except that I was naively worrying about the performance
> detriment of the extra point dereference. I thought the extra speed
> may be worth the cost in memory.

I've pondered this a little more, and I think you have a point.  The  
extra deref is a cost in this design, and for a very tight loop, it  
could make a difference.  In general, pointer dereferencing is such a  
cheap operation it's not something worth worrying about.  But this is  
a major design decision.  We're not talking about a localized area of  
code.  This is everywhere, so it's important to choose correctly.

Maybe we can treat the i/o classes as a special case, since I think  
that's where the majority of our function calls occur.  Potentially,  
we could just have the Streams' macros alias functions directly...

     #define InStream_Write_VInt(instream) InStream_write_vint(instream)

... which is akin to declaring InStream_Write_VInt a "final" method.   
(I assume that at the compiler level, that's how "final" methods are  

That's an option if Lucy adopts the same architecture that Ferret and  
KinoSearch have: no subclasses for InStream and OutStream.

If the function calls are all macrofied, it's easy enough to switch  
up the macro definition and see if there's any effect.  My guess is  
that there will be a small but measurable difference.

This being C, we can also copy a function pointer out of the vtable  
and use it directly if we identify a really tight loop somewhere as a  

     InStream_read_vint_t read_vint = instream->_->read_vint;
     for (1 .. a_lot) {
         foo++ = read_vint(instream);

However, I suspect that if we treat the i/o classes as special cases  
and call their functions directly everywhere, we'll have addressed  
90% of the bottlenecks.

> Anyway, I changed the Streams to test
> the difference and I mean to refactor the rest of the classes like
> this when I have time.

Below my sig you'll find the output of a simple benchmarking app  
"speed.c" written by a guy named Michael Lee, run on my G4 Laptop and  
then also on a Pentium 4.  It shows the relative costs of some simple  
operations.  You can download it from < 
programming/>.  His original 1997 article  
on C optimization (which has a broken link to speed.c) is at <http://>.

Since that article is nearly a decade old, it doesn't spend a lot of  
time on CPU caching, which supposedly is becoming more important  
these days.  I don't have any practical experience doing those kinds  
of optimizations, though.  Maybe we could stuff all the vtables in  
one giant struct and hope that with all our function pointers jammed  
up against one another we'd get less cache thrash.  But I think  
that's chasing our tail.

The bottom line for me is that I'd prefer that the method calls be  
macrofied because it's going to clean up the source quite a bit.  The  
best way to optimize is to find a better algorithm, and simpler,  
cleaner, smaller source code makes refactoring easier.

Marvin Humphrey
Rectangular Research

PowerMac G4 1.67 Laptop

slothbear:~/Desktop marvin$ ./speed
clocks_per_sec = 100
worst case resolution = 500.0000 usec
precision = 0 decimal digits
    (cache & vm load)     0.100 usec   10.044 Mhz
      (loop overhead)     0.100 usec   10.047 Mhz
                empty     0.000 usec25391.792 Mhz
       /* comments */     0.000 usec21680.308 Mhz
              #define     0.000 usec23079.365 Mhz
          declaration     0.000 usec27676.983 Mhz
              array[]     0.003 usec  376.814 Mhz
             *pointer     0.003 usec  366.292 Mhz
                int =     0.003 usec  377.055 Mhz
         empty func()     0.010 usec  102.779 Mhz
            bit shift     0.002 usec  412.078 Mhz
         if-then-else     0.000 usec      n/a
            int + int     0.004 usec  280.621 Mhz
            int - int     0.001 usec 1450.834 Mhz
            int ^ int     0.001 usec 1361.626 Mhz
            int * int     0.004 usec  262.790 Mhz
            int / int     0.015 usec   65.456 Mhz
          (int) float     0.029 usec   33.933 Mhz
        float + float     0.006 usec  171.687 Mhz
        float * float     0.006 usec  158.656 Mhz
        float / float     0.015 usec   66.674 Mhz
             strcpy()     0.036 usec   28.010 Mhz
             strcmp()     0.000 usec 7761.078 Mhz
               rand()     0.032 usec   31.578 Mhz
               sqrt()     0.108 usec    9.296 Mhz
          malloc/free     0.392 usec    2.553 Mhz
         fopen/fclose     7.405 usec    0.135 Mhz
             system()   280.327 usec    0.004 Mhz
slothbear:~/Desktop marvin$

FreeBSD 5.3
Intel(R) Pentium(R) 4 CPU 3.20GHz

$ ./speed
clocks_per_sec = 128
worst case resolution = 390.6250 usec
precision = 0 decimal digits
    (cache & vm load)     0.121 usec    8.235 Mhz
      (loop overhead)     0.134 usec    7.455 Mhz
                empty     0.000 usec      n/a
       /* comments */     0.000 usec      n/a
              #define     0.000 usec      n/a
          declaration     0.000 usec      n/a
              array[]     0.001 usec 1203.097 Mhz
             *pointer     0.001 usec 1374.304 Mhz
                int =     0.001 usec 1153.964 Mhz
         empty func()     0.000 usec      n/a
            bit shift     0.000 usec      n/a
         if-then-else     0.000 usec      n/a
            int + int     0.000 usec      n/a
            int - int     0.001 usec 1279.043 Mhz
            int ^ int     0.001 usec 1052.546 Mhz
            int * int     0.001 usec 1102.379 Mhz
            int / int     0.008 usec  118.546 Mhz
          (int) float     0.001 usec 1141.288 Mhz
        float + float     0.001 usec 1173.412 Mhz
        float * float     0.001 usec 1343.009 Mhz
        float / float     0.006 usec  167.827 Mhz
             strcpy()     0.044 usec   22.755 Mhz
             strcmp()     0.000 usec      n/a
               rand()     0.020 usec   50.176 Mhz
               sqrt()     0.034 usec   29.739 Mhz
          malloc/free     0.218 usec    4.586 Mhz
         fopen/fclose     3.499 usec    0.286 Mhz
             system()   111.848 usec    0.009 Mhz

View raw message