hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Estimating number of worker nodes
Date Tue, 14 Feb 2006 17:13:24 GMT
Michel Tourn wrote:
> but at other times an MR application would rather
> set a relative number of tasks:
>  num Map Tasks = 5 * num active worker nodes
>  num Reduce Tasks = 2 * num active worker nodes
> 
> Intuitively, it seems to be a good thing to consider
> num-active-worker-nodes as a variable rather than
> as a constant known by the MapRed user.

The current assumption is that the default for mapred.map.tasks and 
mapred.reduce tasks is configured for the cluster, based on the number 
of nodes and the number of cpus per node.  For classic MapReduce 
operations around five times the number of CPUs (not nodes) seems to be 
a good number of map tasks, and around one times the number of CPUs 
seems to be a good number of reduce tasks.

It remains to be seen whether there are jobs which benefit greatly from 
varying these.  In my experience, a single setting seems to work well 
for a wide variety of jobs.  I'd rather not introduce a mechanism until 
we can show that it is needed.  Do you have jobs that perform poorly 
with settings like these?

Another better to specify this might be to have parameters 
mapred.num.cpus and mapred.maps.per.cpu and mapred.reduces.per.cpu, the 
latter two with default values of 5 and 1 respectively.  Then it would 
be simpler to configure a cluster, by just specifying the total number 
of cpus.  (Note that mapred.tasktracker.tasks.maximum, the number of map 
or reduce tasks to run simultaneously on a node, is also a key 
cluster-specific parameter, typically set to the number of cpus per node.)

Doug

Mime
View raw message