From Raj V <rajv...@yahoo.com>
Subject Re: When to use a combiner?
Date Wed, 25 Jan 2012 15:46:57 GMT
Touche`!

Raj

> From: Robert Evans <evans@yahoo-inc.com>

>Sent: Wednesday, January 25, 2012 7:36 AM
>Subject: Re: When to use a combiner?
>You can use a combiner for average.  You just have to write a separate combiner from
>Class myCombiner {
>    //The value is sum/count pairs
>    void reduce(Key key, Interable<Pair<Long, Long>> values, Context context)
{
>        long sum = 0;
>        long count = 0;
>        for(Pair<Long, Long> value: values) {
>            sum += pair.first;
>            count += pair.second;
>        }
>        context.write(key, new Pair<Long, Long>(sum, count));
>    }
>}
>Class myReducer {
>    //The value is sum/count pairs
>    void reduce(Key key, Interable<Pair<Long, Long>> values, Context context)
{
>        long sum = 0;
>        long count = 0;
>        for(Pair<Long, Long> value: values) {
>            sum += pair.first;
>            count += pair.second;
>        }
>        context.write(key, ((double)sum)/count);
>    }
>}
>--Bobby Evans
>On 1/24/12 4:34 PM, "Raj V" <rajvish@yahoo.com> wrote:
>Just to add to Sameer's response - you cannot use a combiner in case you are finding the
average  temperature. The combiner running on each mapper will produce the average for that
mapper's output and the reducer will find the average of the combiner outputs, which in this
case will be the average of the averages.
>>You can  use a combiner if your reducer function R is like this
>>
>>R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some
arbitrary partition of the set S.
>>>  From:Sameer Farooqui <sameer@hortonworks.com>
>>> Sent: Tuesday, January 24, 2012 12:22 PM
>>> Subject: Re: When to use a combiner?
>>>Hi Steve,
>>>
>>>Yeah, you're right in your suspicions that a combiner may not be useful in your
use case. It's mainly used to reduce network traffic between the mappers and the reducers.
Hadoop may apply the combiner zero, one or multiple times to the intermediate output from
the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction
in network packets is a lot easier to predict and actually see.
>>>>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve
performance. You should monitor the job's behavior to see if the number of records outputted
by the combiner is meaningfully less than the number of records going in. The reduction must
justify the extra execution time of running a combiner. You can easily check this through
the JobTracker's Web UI."
>>>One thing to point out is don't just assume the combiner's ineffectiveness b/c
it's not reducing the # of unique keys emitted from the Map side. It really depends on your
specific use case for the combiner and the nature of the MapReduce job. For example, imagine
Guide'), like so:
>>>
>>>Node 1's Map output:
>>>(1950, 20)
>>>(1950, 10)
>>>(1950, 40)
>>>Node 2's Map output:
>>>(1950, 0)
>>>(1950, 15)
>>>The reduce function would get this input after the shuffle phase:
>>>(1950, [0, 10, 15, 20, 40])
>>>and the reduce function would output:
>>>(1950, 40)
>>>But if you used a combiner, the reduce function would have gotten smaller input
to work with after the shuffle phase:
>>>(1950, [40, 15])
>>>and the output from Reduce would be the same.
>>>
>>>There are specific use cases like the one above that a combiner makes magical
performance gains for, but it shouldn't by default be used 100% of the time.
>>>Both of the books I mentioned are excellent with tons of real-world tips, so I
highly recommend them.
