hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "HowManyMapsAndReduces" by AmarKamat
Date Wed, 06 Aug 2008 12:32:35 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by AmarKamat:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

------------------------------------------------------------------------------
  The number of maps is usually driven by the number of DFS blocks in the input files. Although
that causes people to adjust their DFS block size to adjust the number of maps. The right
level of parallelism for maps seems to be around 10-100 maps/node, although we have taken
it up to 300 or so for very cpu-light map tasks. 
  Task setup takes awhile, so it is best if the maps take at least a minute to execute.
  
- Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just
a hint to the !InputFormat for the number of maps. The default !InputFormat behavior is to
split the total number of bytes into the right number of fragments. However, the DFS block
size of the input files is treated as an upper bound for input splits. A lower bound on the
split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and
have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger.
+ Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just
a hint to the !InputFormat for the number of maps. The default !InputFormat behavior is to
split the total number of bytes into the right number of fragments. However, in the default
case the DFS block size of the input files is treated as an upper bound for input splits.
A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect
10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks
is even larger. Ultimately the [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
InputFormat] determines the number of maps.
  
  The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int
num). This can be used to increase the number of map tasks, but will not set the number below
that which Hadoop determines via splitting the input data.
  

Mime
View raw message