hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jane Wayne <jane.wayne2...@gmail.com>
Subject Re: strategies to share information between mapreduce tasks
Date Wed, 26 Sep 2012 18:18:26 GMT

thanks. i just needed a sanity check. i hope and expect that one day,
hadoop will mature towards supporting a "shared-something" approach.
the web service call is not a bad idea at all. that way, we can
abstract what that ultimate data store really is.

i'm just a little surprised that we are still in the same state with
hadoop in regards to this issue (there are probably higher priorities)
and that no research (that i know of) has come out of academia to
mitigate some of these limitations of hadoop (where's all the funding
to hadoop/mapreduce research gone to if this framework is the
fundamental building block of a vast amount of knowledge mining

On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas <jayunit100@gmail.com> wrote:
> The reason this is so rare is that the nature of map/reduce tasks is that
> they are orthogonal  i.e. the word count, batch image recognition, tera
> sort -- all the things hadoop is famous for are largely orthogonal tasks.
> Its much more rare (i think) to see people using hadoop to do traffic
> simulations or solve protein folding problems... Because those tasks
> require continuous signal integration.
> 1) First, try to consider rewriting it so that ll communication is replaced
> by state variables in a reducer, and choose your keys wisely, so that all
> "communication" between machines is obviated by the fact that a single
> reducer is receiving all the information relevant for it to do its task.
> 2) If a small amount of state needs to be preserved or cached in real time
> two optimize the situation where two machines might dont have to redo the
> same task (i.e. invoke a web service to get a peice of data, or some other
> task that needs to be rate limited and not duplicated) then you can use a
> fast key value store (like you suggested) like the ones provided by basho (
> http://basho.com/) or amazon (Dynamo).
> 3) If you really need alot of message passing, then then you might be
> better of using an inherently more integrated tool like GridGain... which
> allows for sophisticated message passing between asynchronously running
> processes, i.e.
> http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/.
> It seems like there might not be a reliable way to implement a
> sophisticated message passing architecutre in hadoop, because the system is
> inherently so dynamic, and is built for rapid streaming reads/writes, which
> would be stifled by significant communication overhead.

View raw message