commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J.Pietschmann" <j3322...@yahoo.de>
Subject Re: [math] Complex dilemmas
Date Thu, 12 Jun 2003 18:34:23 GMT
Al Chou wrote:
> Cool reference, thanks.  Did you mean "study the source code"?

Yes, last time I looked it contained some entertaining remarks
in the comments, see
  http://www.mersenne.org/source.htm
 From memory, the story was like
1. Implement FFT straightforwardly. It's slow.
2. Leverage the CPU L1 cache: reshuffle the data so that the
   data access is localized despite the FFT access pattern, and
   break up loops so that half of the L1 cache can hold the
   source and the other half the result of the calculations
   in the loop. It's still slow.
3. Oops, there are the FFT constants which should fit in the L1
   cache as well. Cut down and rearrange loops. It's still slow.
4. Take L1 cache organisation in account. If certain bits of
   the starting address of some data snippets are the same, the
   cache will allocate only a few lines for storing them, then
   start over with the first allocated line. Naturally, if a large
   continous array is loaded this will cause thrashing, even
   though there are still free lines. Introduce gaps to offset
   starts of cache lines so that the whole L1 cache can be
   leveraged. It's *still* slow.
5. Perhaps doing some tricks with early flushing dirty cache
   lines and other stuff, I don't quite remember.
6. Take L2 cache organization into account. Further reshuffling
   and introducing gaps. Write macros to generate the assembler
   code, because nobody can keep track of all the permutations
   without going insane.
   Beyond a certain size, it is *still* slow.
7. Remember the processor has a TLB, which maps the logical address
   space of the process into the physical RAM address space. As
   usual, thrashing happened, causing frequent reloads of mapping
   data from the main RAM which is both expensive in itself and
   swapped valuable data out of the L2 cache. Introduce large gaps
   in the data to avoid this.
   This was the *only* time I ever read about a non-kernel programmer
   having to take this into account.
This was from 1998 or so, and somewhat specific for the Pentium
architecture. More recent processors have larger caches and slightly
different cache organizations, in particular a L2 cache on chip,
there are changes in write back handling, there may be a L3 cache,
RAM access has become more important, and processors aquired SIMD
instructions which are now used.
Note that professional numeric libraries have to deal with similar
problems when it comes to handling large matrices.

Have fun!

J.Pietschmann


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message