Ted,
MPI supports nodetonode communications in ways that mapreduce does not,
> however, which requires that you iterate mapreduce steps for many
> algorithms. With Hadoop's current implementation, this is horrendously
> slow (minimum 2030 seconds per iteration).
>
> Sometimes you can avoid this by clever tricks. For instance, random
> projection can compute the key step in an SVD decomposition with one
> mapreduce while the comparable Lanczos algorithm requires more than one
> step per eigenvector (and we often want 100 of them!).
>
> Sometimes, however, there are no known algorithms that avoid the need for
> repeated communication. For these problems, Hadoop as it stands may be a
> poor fit. Help is on the way, however, with the MapReduce 2.0 work because
> that will allow much more flexible models of computation.
For applications requiring iterative regression, there is an extension for
hadoop, called HaLoop. HaLoop takes advantage of invariant part of input. It
stores them on local disk of reduces to avoid repeated computation against
same data. Another one,Twister, uses long running maps and reduces, and
makes map handle same part of invariant input each iteration.
Neither of them uses interprocess communication. Because main performance
benefits of both is from caching invariant input.
Some machine learning algorithms require features that are much smaller than
> the original input. This leads to exactly the pattern you describe.
> Integrating MPI with mapreduce is currently difficult and/or very ugly,
> however. Not impossible and there are hackish ways to do the job, but they
> are hacks.
As I am not familiar with applications in machine learning, can you give
specific examples I can look into? For opportunities of integrating message
passing, I'm looking for:
Apps has big data and complex computation. Input data can be manipulated by
map reduce at first, then a message passing model is better to be used for
computation. Or vise versa. It may have multiple steps which builds a
workflow.
Elton
