hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "HowManyMapsAndReduces" by SomeOtherAccount
Date Thu, 17 Jul 2014 16:28:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "HowManyMapsAndReduces" page has been changed by SomeOtherAccount:
https://wiki.apache.org/hadoop/HowManyMapsAndReduces?action=diff&rev1=7&rev2=8

  
  == Number of Reduces ==
  
- The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum).
At 0.95 all of the reduces can launch immediately and start transfering map outputs as the
maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch
a second round of reduces doing a much better job of load balancing.
+ The ideal reducers should be the optimal value that gets them closest to:
+ 
+ * A multiple of the block size
+ * A task time between 5 and 15 minutes
+ * Creates the fewest files possible
+ 
+ Anything other than that means there is a good chance your reducers are less than great.
 There is a tremendous tendency for users to use a REALLY high value ("More parallelism means
faster!") or a REALLY low value ("I don't want to blow my namespace quota!").  Both are equally
dangerous, resulting in one or more of:
+ 
+ * Terrible performance on the next phase of the workflow
+ * Terrible performance due to the shuffle
+ * Terrible overall performance because you've overloaded the namenode with objects that
are ultimately useless
+ * Destroying disk IO for no really sane reason
+ * Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work
+ 
+ Now, there are always exceptions and special cases. One particular special case is that
if following that advice makes the next step in the workflow do ridiculous things, then we
need to likely 'be an exception' in the above general rules of thumb.
  
  Currently the number of reduces is limited to roughly 1000 by the buffer size for the output
files (io.buffer.size * 2 * numReduces << heapSize). This will be fixed at some point,
but until it is it provides a pretty firm upper bound.
  
- The number of reduces also controls the number of output files in the output directory,
but usually that is not important because the next map/reduce step will split them into even
smaller splits for the maps.
- 
  The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's
conf.setNumReduceTasks(int num).
  

Mime
View raw message