hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Piccolboni <anto...@piccolboni.info>
Subject Re: conf.setCombinerClass in Map/Reduce
Date Wed, 06 Oct 2010 04:48:27 GMT
On Tue, Oct 5, 2010 at 4:32 PM, Shi Yu <shiyu@uchicago.edu> wrote:

> Hi,
> I am still confused about the effect of using Combiner in Hadoop
> Map/Reduce. The performance tips (
> http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/)
> suggest us to write a combiner to do initial aggregation before the data
> hits the reducer for performance advantages. But in most of the example code
> or book I have seen, a same reduce class is set as the reducer and the
> combiner, such as
> conf.setCombinerClass(Reduce.class);
> conf.setReducerClass(Reduce.class);
> I don't know what is the specific reason doing like this. In my own code
> based on Hadoop 0.19.2, if I set the combiner class as the reduce class
> using MultipleOutputs, the output files will be named as xxx-m-00000. And if
> there are multiple input paths, the number of output files will be the same
> as the input paths number. The conf.setNumReduceTasks(int) has no use to
> control the output file number now. I wonder where are the reducer generated
> outputs in this case because I cannot see them.  To see the reducer output,
> I have to remove the combiner class
> //conf.setCombinerClass(Reduce.class);
> conf.setReducerClass(Reduce.class);
> and then get the output files named as xxx-r-00000. I could then control
> the output file number using conf.setNumReduceTasks(int).
> So my question is what is the main advantage to set combiner class and
> reducer class using the same reduce class?

When the calculation performed by the reducer is commutative and
associative, with a combiner you get more work done before the shuffle, less
sorting and shuffling and less work in the reducer. Like in the word count
app, the mapper emits <"the", 1> a billion times, but with a combiner equal
to the reducer only <the, 10^9> has to travel to the reducer. If you
couldn't use the combiner, not only the shuffle phase would be as heavy as
if you had a billion distinct words, but also the poor reducer that gets the
"the" key would be very slow. So you would have to go through multiple
mapreduce phases to aggregate the data anyway.

> How to merge the output files in this case?

While I am not sure what you mean, there is no difference to you. The output
is the same.

> And where to find any real example using different Combiner/Reducer classes
> to improve the map/reduce performance?

If you want to compute an average, the combiner needs to do only sums, the
reducer sums and the final division. It would  not be OK to divide in the
combiner. See also

The interface of reducer and combiner are the same, but they need not be the
same class.


> Thanks.
> Shi

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message