hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Terasort
Date Thu, 10 May 2012 22:49:38 GMT
Changing subject...

On May 10, 2012, at 3:40 PM, Jeffrey Buell wrote:

> I have the right #slots to fill up memory across the cluster, and all those slots are
filled with tasks. The problem I ran into was that the maps grabbed all the slots initially
and the reduces had a hard time getting started.  As maps finished, more maps were started
and only rarely was a reduce started.  I assume this behavior occurred because I had ~4000
map tasks in the queue, but only ~100 reduce tasks.  If the scheduler lumps maps and reduces
together, then whenever a slot opens up it will almost surely be taken by a map task.  To
get good performance I need all reduce tasks started early on, and have only map tasks compete
for open slots.  Other apps may need different priorities between maps and reduces.  In any
case, I don’t understand how treating maps and reduces the same is workable.
>  

Are you playing with YARN or MR1?

IAC, you are getting hit by 'slowstart' for reduces where-in reduces aren't scheduled till
sufficient % of maps are completed.

Set mapred.reduce.slowstart.completed.maps to 0. (That should work for either MR1 or MR2).

Arun

> Jeff
>  
> From: Arun C Murthy [mailto:acm@hortonworks.com] 
> Sent: Thursday, May 10, 2012 1:27 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: max 1 mapper per node
>  
> For terasort you want to fill up your entire cluster with maps/reduces as fast as you
can to get the best performance.
>  
> Just play with #slots.
>  
> Arun
>  
> On May 9, 2012, at 12:36 PM, Jeffrey Buell wrote:
> 
> 
> Not to speak for Radim, but what I’m trying to achieve is performance at least as good
as 0.20 for all cases.  That is, no regressions.  For something as simple as terasort, I don’t
think that is possible without being able to specify the max number of mappers/reducers per
node.  As it is, I see slowdowns as much as 2X.  Hopefully I’m wrong and somebody will straighten
me out.  But if I’m not, adding such a feature won’t lead to bad behavior of any kind
since the default could be set to unlimited and thus have no effect whatsoever.
>  
> I should emphasize that I support the goal of greater automation since Hadoop has way
too many parameters and is so hard to tune.  Just not at the expense of performance regressions.

>  
> Jeff
>  
>  
> We've been against these 'features' since it leads to very bad behaviour across the cluster
with multiple apps/users etc.
>  
> What is your use-case i.e. what are you trying to achieve with this?
>  
> thanks,
> Arun
>  
> On May 3, 2012, at 5:59 AM, Radim Kolar wrote:
> 
> 
> 
> if plugin system for AM is overkill, something simpler can be made like:
> 
> maximum number of mappers per node
> maximum number of reducers per node
> 
> maximum percentage of non data local tasks
> maximum percentage of rack local tasks
> 
> and set this in job properties.
>  
>  
>  
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
> 
>  

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Mime
View raw message