Without outrageous magic FFT algorithms, nxn times nxn matrix multiply
requires k_1 n^3 flops and at least k_2 n^2 I/O operations. If you can
arrange to reuse every operand a LOT, then you only have to do 2 n^2 IOPS,
but if you fail in that, then I/O tends toward n^3 as well. If you do
manage to get sufficient reuse, then the fact that FLOPS are faster than
IOPS let's us be CPUbound. In a single machine, IOPS are pretty fast so
you can saturate the CPU more easily than some other case.
For mapreduce on multiple machines, the IOPS/FLOPS ratio becomes evil and
keeping the CPU loaded becomes much harder. In general, it is hard to
transfer even 50MBytes/s into a single mapper which is only 5M floating
point numbers per second. At a perfect 40K FLOPS per number, this is only
200MFlops per CPU (with several cores!!). Well tuned dense multiply
routines should crank out about 23GFlops per core. Thus, for an 8 core
machine, we have a big deficiency to overcome before Hadoop style mapreduce
is practical for this.
For lots of algorithms, this gets even worse. The convergence rate of
online gradient suffers badly with any kind of delay between computation of
the gradient and its application to the model. Batching is a kind of
delay. Moreover, a single CPU is often entirely sufficient to saturate I/O
devices. This means that parallelizing the algorithm gives almost no wall
clock convergence speedup at all even though all CPU's might be very, very
busy.
On Sun, Apr 11, 2010 at 9:27 PM, Steven Buss <steven.buss@gmail.com> wrote:
> If you're just doing matrix multiplication, I would advise that mahout
> (or any mapreduce approach) isn't well suited to your problem. I did
> the same computation with matlab (multiplying two 40k x 40k random
> double precision dense matrices) using 14 cores and about 36GB of ram
> on a single machine* and it finished in about 55 minutes. If I'm
> reading your email correctly, you were working with 34*2*4=272 cores!
> I'm not sure if dense matrix multiplication can actually be
> efficiently mapreduced, but I am still a rookie so don't take my word
> for it.
>
> *The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz
> per core, with 64GB total system memory.
>
> Steven Buss
> steven.buss@gmail.com
> http://www.stevenbuss.com/
>
>
>
> On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > Vimal,
> >
> > We don't have any distributed dense multiplication operations because we
> > have not yet found much application demand for distributed dense matrix
> > multiplication. Distributed sparse matrix operations are a big deal,
> > however.
> >
> > If you are interested in working on the problem in the context of Mahout,
> we
> > would love to help. This is especially true if you have an application
> that
> > needs dense operations and could benefit from some of the other
> capabilities
> > in Mahout.
> >
> > On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vml.mathew@gmail.com>
> wrote:
> >
> >> Hi,
> >> What's the current state of matrixmatrix multiplication in Mahout?
> >> Are there any performance results available for large matrices?
> >>
> >> I have been working on a Hadoopcompatible distributed storage for
> >> matrices. I can currently multiply two 40K x 40K dense double
> >> precision matrices in around 1 hour using 34 systems (16GB RAM, two
> >> Core2Quads' per node). I was wondering how this compares with Mahout.
> >>
> >> Regards,
> >> Vimal
> >>
> >
>
