hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elton sky <eltonsky9...@gmail.com>
Subject Re: questions about hadoop map reduce and compute intensive related applications
Date Sun, 01 May 2011 03:13:47 GMT

MPI supports node-to-node communications in ways that map-reduce does not,
> however, which requires that you iterate map-reduce steps for many
> algorithms.   With Hadoop's current implementation, this is horrendously
> slow (minimum 20-30 seconds per iteration).
> Sometimes you can avoid this by clever tricks.  For instance, random
> projection can compute the key step in an SVD decomposition with one
> map-reduce while the comparable Lanczos algorithm requires more than one
> step per eigenvector (and we often want 100 of them!).
> Sometimes, however, there are no known algorithms that avoid the need for
> repeated communication.  For these problems, Hadoop as it stands may be a
> poor fit.  Help is on the way, however, with the MapReduce 2.0 work because
> that will allow much more flexible models of computation.

For applications requiring iterative regression, there is an extension for
hadoop, called HaLoop. HaLoop takes advantage of invariant part of input. It
stores them on local disk of reduces to avoid repeated computation against
same data. Another one,Twister, uses long running maps and reduces, and
makes map handle same part of invariant input each iteration.
Neither of them uses interprocess communication. Because main performance
benefits of both is from caching invariant input.

Some machine learning algorithms require features that are much smaller than
> the original input.  This leads to exactly the pattern you describe.
>  Integrating MPI with map-reduce is currently difficult and/or very ugly,
> however.  Not impossible and there are hackish ways to do the job, but they
> are hacks.

 As I am not familiar with applications in machine learning, can you give
specific examples I can look into? For opportunities of integrating message
passing, I'm looking for:
Apps has big data and complex computation. Input data can be manipulated by
map reduce at first, then a message passing model is better to be used for
computation. Or vise versa. It may have multiple steps which builds a


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message