hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy.had...@gmail.com>
Subject Re: Map Reduce Phase questions:
Date Sat, 17 Dec 2011 15:13:12 GMT
       Adding on to the responses, The map outputs are transferred to
corresponding reducer over http and no through TCP.
Definitely the available hardware definitely decides on the max num of
tasks that a node can handle, it  depends on number of cores, available
physical memory etc. But that is not the only deciding factor, there are
other factors like
- for what all purposes you cluster is used for
- If you use Hbase in your cluster you have to consider the hardware you
need to allocate for the same
- memory requirements for the generic jobs on your cluster etc


On Sat, Dec 17, 2011 at 2:53 AM, Ann Pal <ann_r_pal@yahoo.com> wrote:

> Thanks a lot for your answers!
> For [1] With the "Pull" model chances of seeing a TCP-Incast problem where
> multiple map nodes send data to the same reduce node at the same time are
> minimal (since the reducer is responsible for retrieving data it can
> handle). Is this a valid assumption?
> For [3] what i meant was how does the infra decide how many map/reduce
> slots are present for a given node. Is it based on capacity (memory/cpu) of
> the node?
> Thanks again in advance..
>   ------------------------------
> *From:* Harsh J <harsh@cloudera.com>
> *To:* mapreduce-user@hadoop.apache.org; Ann Pal <ann_r_pal@yahoo.com>
> *Sent:* Friday, December 16, 2011 6:22 AM
> *Subject:* Re: Map Reduce Phase questions:
> On Fri, Dec 16, 2011 at 7:03 PM, Ann Pal <ann_r_pal@yahoo.com> wrote:
> > Hi,
> > I had some questions specifically on the Map-Reduce phase:
> >
> > [1] For the reduce phase, the TaskTrackers corresponding to the reduce
> node,
> > poll the Job Tracker to know about maps that have completed and if the
> > Jobtracker informs it about maps that are complete, it then pulls the
> data
> > from the map node where the map is complete. This is a "pull" model as
> > opposed to "push" model where the map directly sends a region of the map
> > output to the appropriate reduce node. Is the pull model the default  in
> > 0.20, 0.23 etc ?
> Yes, it is.
> > In the pull model, how does the Reduce node know it is responsible for a
> > particular region of map output? (Is this determined up front? From
> where it
> > gets this information?)
> The reducer ID is == the partition ID from the map side. Thereby,
> reducer 0 will pull 0th partitions, and so on.
> > [2]There can be multiple reduce tasks per reduce node. The number of
> reduce
> > tasks is configurable, How about the number of reduce nodes? How is this
> > determined?
> Reducers are assigned by the scheduler in use. They are assigned once
> per TT heartbeat, to have somewhat of an equal distribution today -
> but there is no way to induce locality of reducers from the
> user-program itself. So far I've not seen a case where this may be
> absolutely necessary to have, as the schedulers today are pretty
> capable of doing things right.
> > [3]Pre 0.23, The map/reduce tasks slots for a node are allocated
> statically
> > . Is this based on just configuration ?
> What do you mean by 'allocated statically' here? Are you talking about
> slot limit configurations?
> --
> Harsh J

View raw message