hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject RE: MR job scheduler
Date Fri, 21 Aug 2009 06:35:01 GMT
Yes, but the copy phase starts with the initialization for a reducer, after which it would
keep polling for completed map tasks to fetch the respective outputs.

-----Original Message-----
From: bharath vissapragada [mailto:bharathvissapragada1990@gmail.com] 
Sent: Friday, August 21, 2009 12:00 PM
To: common-user@hadoop.apache.org
Subject: Re: MR job scheduler

Amogh

i think Reduce phase starts only when all the map phases are completed .
Because it needs all the values corresponding to a particular key!

2009/8/21 Amogh Vasekar <amogh@yahoo-inc.com>

> I'm not sure that is the case with Hadoop. I think its assigning reduce
> task to an available tasktracker at any instant; Since a reducer polls JT
> for completed maps. And if it were the case as you said, a reducer wont be
> initialized until all maps have completed , after which copy phase would
> start.
>
> Thanks,
> Amogh
>
> -----Original Message-----
> From: bharath vissapragada [mailto:bharathvissapragada1990@gmail.com]
> Sent: Friday, August 21, 2009 9:50 AM
> To: common-user@hadoop.apache.org
> Subject: Re: MR job scheduler
>
> OK i'll be a bit more specific ,
>
> Suppose map outputs 100 different keys .
>
> Consider a key "K" whose correspoding values may be on N diff datanodes.
> Consider a datanode "D" which have maximum number of values . So instead of
> moving the values on "D"
> to other systems , it is useful to bring in the values from other datanodes
> to "D" to minimize the data movement and
> also the delay. Similar is the case with All the other keys . How does the
> scheduler take care of this ?
> 2009/8/21 zjffdu <zjffdu@gmail.com>
>
> > Add some detials:
> >
> > 1. #map is determined by the block size and InputFormat (whether you can
> > want to split or not split)
> >
> > 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and
> > Capacity Scheduler are other two options as I know.  JobTracker has the
> > scheduler.
> >
> > 3. Once the map task is done, it will tell its own tasktracker, and the
> > tasktracker will tell jobtracker, so jobtracker manage all the tasks, and
> > it
> > will decide how to and when to start the reduce task
> >
> >
> >
> > -----Original Message-----
> > From: Arun C Murthy [mailto:acm@yahoo-inc.com]
> > Sent: 2009年8月20日 11:41
> > To: common-user@hadoop.apache.org
> > Subject: Re: MR job scheduler
> >
> >
> > On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote:
> >
> > > Hi all,
> > >
> > > Can anyone tell me how the MR scheduler schedule the MR jobs?
> > > How does it decide where t create MAP tasks and how many to create.
> > > Once the MAP tasks are over how does it decide to move the keys to the
> > > reducer efficiently(minimizing the data movement across the network).
> > > Is there any doc available which describes this scheduling process
> > > quite
> > > efficiently
> > >
> >
> > The #maps is decided by the application. The scheduler decides where
> > to execute them.
> >
> > Once the map is done, the reduce tasks connect to the tasktracker (on
> > the node where the map-task executed) and copies the entire output
> > over http.
> >
> > Arun
> >
> >
>

Mime
View raw message