hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: Task type priorities during scheduling ?
Date Fri, 21 Jul 2006 08:29:46 GMT
Of course interleaving the sort with the copy phase would also be  
desirable...

But I'm all for clearly IDing reduces vs shuffle.


On Jul 20, 2006, at 5:57 PM, Paul Sutter wrote:

> perhaps we should consider separating the copy phase of the reducer
> from the execution phase, and exempt the copy phase from the reduce
> task limit?
>
> this is a confusing issue, but more importantly, the file copy phase
> uses little resources, as compared with the reduce phase itself
> (thinking of the memory and CPU that goes into sorting and the
> reducer).
>
>
>
> On 7/20/06, Yoram Arnon <yarnon@yahoo-inc.com> wrote:
>> "mapred.tasktracker.tasks.maximum" does apply to per task type.
>>
>> The reason reduce tasks launch from the get go is that they  
>> collect the
>> output from map tasks as soon as it's available. The observation  
>> is that the
>> shuffle of the data from map tasks to reduce tasks over the  
>> network is often
>> the number one bottleneck of the entire job, so starting that  
>> early and
>> keeping the network saturation all during job execution optimizes job
>> execution time.
>>
>> In your case, ideally your 41 reducers will have almost all their  
>> input
>> ready and waiting when the map tasks complete, and will  
>> immediately start
>> sorting and reducing. More likely, the maps will complete faster  
>> than data
>> can be shipped to the reducers, so the reducers will still wait  
>> for it, but
>> for less time than if they were just launched. All during map  
>> execution data
>> was being shipped to them.
>>
>> Yoram
>>
>> > -----Original Message-----
>> > From: Kalbande, Manish [mailto:mkalbande@shopping.com]
>> > Sent: Thursday, July 20, 2006 11:32 AM
>> > To: hadoop-user@lucene.apache.org
>> > Subject: Task type priorities during scheduling ?
>> >
>> > Hi,
>> >
>> > I am running a cluster of 21 nodes.
>> > while running any task I observed that reduce tasks are getting
>> > scheduled much before all the map tasks are finished.
>> > As a result, reduce tasks are waiting for map tasks to finish
>> > and total
>> > time for map tasks is more because they are not getting scheduled
>> > quickly.
>> >
>> > It will be better if reduce tasks are scheduled only after
>> > there are no
>> > map tasks to be performed.
>> >
>> > For example, during generate job, we had total 544 map tasks and 41
>> > reduce tasks.
>> > All 41 reduce tasks got scheduled and only 42 map tasks could be
>> > schedules at a time.
>> >
>> > My current configuration
>> >
>> > mapred.map.tasks = 83
>> > mapred.reduce.tasks=41
>> > mapred.tasktracker.tasks.maximum=2
>> >
>> > Also, does "mapred.tasktracker.tasks.maximum" applies to per
>> > task type?
>> > or is it for all tasks? From my observation is appears to be per  
>> task
>> > type.
>> >
>> > thanks
>> > Manish
>> >
>> >
>>
>>


Mime
View raw message