hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Sutter" <sut...@gmail.com>
Subject Re: Task type priorities during scheduling ?
Date Fri, 21 Jul 2006 00:57:21 GMT
perhaps we should consider separating the copy phase of the reducer
from the execution phase, and exempt the copy phase from the reduce
task limit?

this is a confusing issue, but more importantly, the file copy phase
uses little resources, as compared with the reduce phase itself
(thinking of the memory and CPU that goes into sorting and the
reducer).



On 7/20/06, Yoram Arnon <yarnon@yahoo-inc.com> wrote:
> "mapred.tasktracker.tasks.maximum" does apply to per task type.
>
> The reason reduce tasks launch from the get go is that they collect the
> output from map tasks as soon as it's available. The observation is that the
> shuffle of the data from map tasks to reduce tasks over the network is often
> the number one bottleneck of the entire job, so starting that early and
> keeping the network saturation all during job execution optimizes job
> execution time.
>
> In your case, ideally your 41 reducers will have almost all their input
> ready and waiting when the map tasks complete, and will immediately start
> sorting and reducing. More likely, the maps will complete faster than data
> can be shipped to the reducers, so the reducers will still wait for it, but
> for less time than if they were just launched. All during map execution data
> was being shipped to them.
>
> Yoram
>
> > -----Original Message-----
> > From: Kalbande, Manish [mailto:mkalbande@shopping.com]
> > Sent: Thursday, July 20, 2006 11:32 AM
> > To: hadoop-user@lucene.apache.org
> > Subject: Task type priorities during scheduling ?
> >
> > Hi,
> >
> > I am running a cluster of 21 nodes.
> > while running any task I observed that reduce tasks are getting
> > scheduled much before all the map tasks are finished.
> > As a result, reduce tasks are waiting for map tasks to finish
> > and total
> > time for map tasks is more because they are not getting scheduled
> > quickly.
> >
> > It will be better if reduce tasks are scheduled only after
> > there are no
> > map tasks to be performed.
> >
> > For example, during generate job, we had total 544 map tasks and 41
> > reduce tasks.
> > All 41 reduce tasks got scheduled and only 42 map tasks could be
> > schedules at a time.
> >
> > My current configuration
> >
> > mapred.map.tasks = 83
> > mapred.reduce.tasks=41
> > mapred.tasktracker.tasks.maximum=2
> >
> > Also, does "mapred.tasktracker.tasks.maximum" applies to per
> > task type?
> > or is it for all tasks? From my observation is appears to be per task
> > type.
> >
> > thanks
> > Manish
> >
> >
>
>

Mime
View raw message