hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <dar...@ontrenet.com>
Subject Re: Why inter-rack communication in mapreduce slow?
Date Mon, 06 Jun 2011 15:00:05 GMT

IMO, that's right. Because map/reduce/hadoop was originally designed for
that kind of text processing purpose. (i.e. few stages, low dependency,
highly parallel).

Its when one tries to solve general purpose algorithms of modest
complexity that map/reduce gets into I/O churning problems. 

On Mon, 6 Jun 2011 23:58:53 +1000, elton sky <eltonsky9404@gmail.com>
> Hi John,
> Because for map task, job tracker tries to assign them to local data
> so there' not much n/w traffic.
> Then the only potential issue will be, as you said, reducers, which
> data from all maps.
> So in other words, if the application only creates small intermediate
> output, e.g. grep, wordcount, this jam between racks is not likely
> is it?
> On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
> <john.armstrong@ccri.com>wrote:
>> On Mon, 06 Jun 2011 09:34:56 -0400, <darren@ontrenet.com> wrote:
>> > Yeah, that's a good point.
>> >
>> > I wonder though, what the load on the tracker nodes (port et. al)
>> > be if a inter-rack fiber switch at 10's of GBS' is getting maxed.
>> >
>> > Seems to me that if there is that much traffic being mitigate across
>> > racks, that the tracker node (or whatever node it is) would overload
>> > first?
>> It could happen, but I don't think it would always.  For example,
>> is on rack A; sees that the best place to put reducer R is on rack B;
>> sees
>> reducer still needs a few hellabytes from mapper M on rack C; tells M
>> send data to R; switches on B and C get throttled, leaving A free to
>> handle
>> other things.
>> In fact, it almost makes me wonder if an ideal setup is not only to
>> each of the main control daemons on their own nodes, but to put THOSE
>> nodes
>> on their own rack and keep all the data elsewhere.

View raw message