hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject Re: conf.setCombinerClass in Map/Reduce
Date Wed, 06 Oct 2010 05:03:53 GMT
Hi, thanks for the answer, Antonio.

I have found one of the main problem. It was because I used the 
MultipleOutputs in the Reduce class, so when I set the Combiner and the 
Reducer, the Combiner will not provide normal data flow to the Reducer. 
Therefore, the program ceases at the Combiner and no Reducer actually 
works. To solve this, I have to use both outputs:

OutputCollector collector = 
multipleOutputs.getCollector("stringlabel",keyText,reporter)
collector.collect(keyText, value);
output.collect(key,value);

The collector generates the separated output files, the output makes 
sure the data flow is exchanged towards the Reducer. After this change, 
both Combiner and Reducer now work.

The remaining question is if I want to use the Combiner and the Reducer, 
should the input and output of Reduce class be the same <K2,V2>? 
Otherwise how to do it? I found the use case is very limited here, for 
example, if the Reducer class is a little bit complicated having the 
input as <K2,V2> and output as <K3,V3>?

Thanks again.

Shi


On 2010-10-5 23:48, Antonio Piccolboni wrote:
> On Tue, Oct 5, 2010 at 4:32 PM, Shi Yu<shiyu@uchicago.edu>  wrote:
>
>    
>> Hi,
>>
>> I am still confused about the effect of using Combiner in Hadoop
>> Map/Reduce. The performance tips (
>> http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/)
>> suggest us to write a combiner to do initial aggregation before the data
>> hits the reducer for performance advantages. But in most of the example code
>> or book I have seen, a same reduce class is set as the reducer and the
>> combiner, such as
>>
>> conf.setCombinerClass(Reduce.class);
>>
>> conf.setReducerClass(Reduce.class);
>>
>>
>> I don't know what is the specific reason doing like this. In my own code
>> based on Hadoop 0.19.2, if I set the combiner class as the reduce class
>> using MultipleOutputs, the output files will be named as xxx-m-00000. And if
>> there are multiple input paths, the number of output files will be the same
>> as the input paths number. The conf.setNumReduceTasks(int) has no use to
>> control the output file number now. I wonder where are the reducer generated
>> outputs in this case because I cannot see them.  To see the reducer output,
>> I have to remove the combiner class
>>
>> //conf.setCombinerClass(Reduce.class);
>>
>> conf.setReducerClass(Reduce.class);
>>
>>
>> and then get the output files named as xxx-r-00000. I could then control
>> the output file number using conf.setNumReduceTasks(int).
>>
>> So my question is what is the main advantage to set combiner class and
>> reducer class using the same reduce class?
>>      
>
> When the calculation performed by the reducer is commutative and
> associative, with a combiner you get more work done before the shuffle, less
> sorting and shuffling and less work in the reducer. Like in the word count
> app, the mapper emits<"the", 1>  a billion times, but with a combiner equal
> to the reducer only<the, 10^9>  has to travel to the reducer. If you
> couldn't use the combiner, not only the shuffle phase would be as heavy as
> if you had a billion distinct words, but also the poor reducer that gets the
> "the" key would be very slow. So you would have to go through multiple
> mapreduce phases to aggregate the data anyway.
>
>
>
>    
>> How to merge the output files in this case?
>>      
>
> While I am not sure what you mean, there is no difference to you. The output
> is the same.
>
>
>
>    
>> And where to find any real example using different Combiner/Reducer classes
>> to improve the map/reduce performance?
>>
>>      
> If you want to compute an average, the combiner needs to do only sums, the
> reducer sums and the final division. It would  not be OK to divide in the
> combiner. See also
> http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/
>
> The interface of reducer and combiner are the same, but they need not be the
> same class.
>
>
> Antonio
>
>
>
>    
>> Thanks.
>>
>> Shi
>>
>>
>>
>>
>>
>>
>>
>>      
>    


-- 
Postdoctoral Scholar
Institute for Genomics and Systems Biology
Department of Medicine, the University of Chicago
Knapp Center for Biomedical Discovery
900 E. 57th St. Room 10148
Chicago, IL 60637, US
Tel: 773-702-6799


Mime
View raw message