hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: strategies to share information between mapreduce tasks
Date Wed, 26 Sep 2012 19:11:53 GMT
Also read: http://arxiv.org/abs/1209.2191 ;-)

On Thu, Sep 27, 2012 at 12:24 AM, Bertrand Dechoux <dechouxb@gmail.com> wrote:
> I wouldn't so surprised. It takes times, energy and money to solve problems
> and make solutions that would be prod-ready. A few people would consider
> that the namenode/secondary spof is a limit for Hadoop itself in order to
> go into a critical production environnement. (I am only quoting it and
> don't want to start a discussion about it.)
>
> One paper that I heard about (but didn't have the time to read as of now)
> might be related to your problem space
> http://arxiv.org/abs/1110.4198
> But research paper does not mean prod ready for tomorrow.
>
> http://research.google.com/archive/mapreduce.html is from 2004.
> and http://research.google.com/pubs/pub36632.html (dremel) is from 2010.
>
> Regards
>
> Bertrand
>
> On Wed, Sep 26, 2012 at 8:18 PM, Jane Wayne <jane.wayne2978@gmail.com>wrote:
>
>> jay,
>>
>> thanks. i just needed a sanity check. i hope and expect that one day,
>> hadoop will mature towards supporting a "shared-something" approach.
>> the web service call is not a bad idea at all. that way, we can
>> abstract what that ultimate data store really is.
>>
>> i'm just a little surprised that we are still in the same state with
>> hadoop in regards to this issue (there are probably higher priorities)
>> and that no research (that i know of) has come out of academia to
>> mitigate some of these limitations of hadoop (where's all the funding
>> to hadoop/mapreduce research gone to if this framework is the
>> fundamental building block of a vast amount of knowledge mining
>> activities?).
>>
>> On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas <jayunit100@gmail.com> wrote:
>> > The reason this is so rare is that the nature of map/reduce tasks is that
>> > they are orthogonal  i.e. the word count, batch image recognition, tera
>> > sort -- all the things hadoop is famous for are largely orthogonal tasks.
>> > Its much more rare (i think) to see people using hadoop to do traffic
>> > simulations or solve protein folding problems... Because those tasks
>> > require continuous signal integration.
>> >
>> > 1) First, try to consider rewriting it so that ll communication is
>> replaced
>> > by state variables in a reducer, and choose your keys wisely, so that all
>> > "communication" between machines is obviated by the fact that a single
>> > reducer is receiving all the information relevant for it to do its task.
>> >
>> > 2) If a small amount of state needs to be preserved or cached in real
>> time
>> > two optimize the situation where two machines might dont have to redo the
>> > same task (i.e. invoke a web service to get a peice of data, or some
>> other
>> > task that needs to be rate limited and not duplicated) then you can use a
>> > fast key value store (like you suggested) like the ones provided by
>> basho (
>> > http://basho.com/) or amazon (Dynamo).
>> >
>> > 3) If you really need alot of message passing, then then you might be
>> > better of using an inherently more integrated tool like GridGain... which
>> > allows for sophisticated message passing between asynchronously running
>> > processes, i.e.
>> >
>> http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/
>> .
>> >
>> >
>> > It seems like there might not be a reliable way to implement a
>> > sophisticated message passing architecutre in hadoop, because the system
>> is
>> > inherently so dynamic, and is built for rapid streaming reads/writes,
>> which
>> > would be stifled by significant communication overhead.
>>
>
>
>
> --
> Bertrand Dechoux



-- 
Harsh J

Mime
View raw message