hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject conf.setCombinerClass in Map/Reduce
Date Tue, 05 Oct 2010 23:32:01 GMT
Hi,

I am still confused about the effect of using Combiner in Hadoop 
Map/Reduce. The performance tips 
(http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/) 
suggest us to write a combiner to do initial aggregation before the data 
hits the reducer for performance advantages. But in most of the example 
code or book I have seen, a same reduce class is set as the reducer and 
the combiner, such as

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);


I don't know what is the specific reason doing like this. In my own code 
based on Hadoop 0.19.2, if I set the combiner class as the reduce class 
using MultipleOutputs, the output files will be named as xxx-m-00000. 
And if there are multiple input paths, the number of output files will 
be the same as the input paths number. The conf.setNumReduceTasks(int) 
has no use to control the output file number now. I wonder where are the 
reducer generated outputs in this case because I cannot see them.  To 
see the reducer output, I have to remove the combiner class

//conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);


and then get the output files named as xxx-r-00000. I could then control 
the output file number using conf.setNumReduceTasks(int).

So my question is what is the main advantage to set combiner class and 
reducer class using the same reduce class? How to merge the output files 
in this case? And where to find any real example using different 
Combiner/Reducer classes to improve the map/reduce performance?

Thanks.

Shi







Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message