hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj V <rajv...@yahoo.com>
Subject Re: When to use a combiner?
Date Tue, 24 Jan 2012 22:34:35 GMT
Just to add to Sameer's response - you cannot use a combiner in case you are finding the average
 temperature. The combiner running on each mapper will produce the average for that mapper's
output and the reducer will find the average of the combiner outputs, which in this case will
be the average of the averages.

You can  use a combiner if your reducer function R is like this

R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary
partition of the set S.


> From: Sameer Farooqui <sameer@hortonworks.com>
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, January 24, 2012 12:22 PM
>Subject: Re: When to use a combiner?
>Hi Steve,
>Yeah, you're right in your suspicions that a combiner may not be useful 
in your use case. It's mainly used to reduce network traffic between the
 mappers and the reducers. Hadoop may apply the combiner zero, one or 
multiple times to the intermediate output from the mapper, so it's hard 
to accurately predict the CPU impact a combiner will have. The reduction
 in network packets is a lot easier to predict and actually see.
>>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily 
improve performance. You should monitor the job's behavior to see if the
 number of records outputted by the combiner is meaningfully less than 
the number of records going in. The reduction must justify the extra 
execution time of running a combiner. You can easily check this through 
the JobTracker's Web UI."
>One thing to point out is don't just assume the combiner's 
ineffectiveness b/c it's not reducing the # of unique keys emitted from 
the Map side. It really depends on your specific use case for the 
combiner and the nature of the MapReduce job. For example, imagine your 
map tasks find the maximum temperature for a given year (example from 
'Hadoop: The Definitive Guide'), like so:
>Node 1's Map output:
>(1950, 20)
>(1950, 10)
>(1950, 40)
>Node 2's Map output:
>(1950, 0)
>(1950, 15)
>The reduce function would get this input after the shuffle phase:
>(1950, [0, 10, 15, 20, 40])
>and the reduce function would output:
>(1950, 40)
>But if you used a combiner, the reduce function would have gotten 
smaller input to work with after the shuffle phase:
>(1950, [40, 15]) 
>and the output from Reduce would be the same.
>There are specific use cases like the one above that a combiner makes 
magical performance gains for, but it shouldn't by default be used 100% 
of the time.
>Both of the books I mentioned are excellent with tons of real-world 
tips, so I highly recommend them.
>Sameer Farooqui
>Systems Architect / HortonWorks
>Steve Lewis
>>January 24, 2012 
9:33 AM
>>In working a sample issue I used a combiner - I noticed that the Combiner
>>output records were 
90% of the Combiner Input records and
>>when looking at the data found 
relatively few duplicated keys. This raises
>>the question of what 
fraction of duplicate keys makes it reasonable to
>>use a combiner - If
 every key is unique I presume that using a combiner
>>will waste time 
and resources - especially if the data is large but
>>what fraction of 
duplicated keys is needed to justify a combiner??
View raw message