hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <chri...@yahoo-inc.com>
Subject Re: Combiner run specification and questions
Date Tue, 06 Jan 2009 03:22:51 GMT
The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce. It is run when  
the collector spills in the map and in some merge cases. If the  
combiner transforms the key, it is illegal to change its type, the  
partition to which it is assigned, or its ordering.

For example, if you emit a record (k,v) from your map and (k',v) from  
the combiner, your comparator is C(K,K) and your partitioner function  
is P(K), it must be the case that P(k) == P(k') and C(k,k') == 0. If  
either of these does not hold, the semantics to the reduce are broken.  
Clearly, if k is not transformed (as in true for most combiners), this  
holds trivially.

As was mentioned earlier, the purpose of the combiner is to compress  
data pulled across the network and spilled to disk. It should not  
affect the correctness or, in most cases, the output of the job. -C

On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:

> Hello,
> I would just like to confirm, when does the Combiner run(since it
> might not be run at all,see below). I read somewhere that it is run,
> if there is at least one reduce (which in my case i can be sure of).
> I also read, that the combiner is an optimization. However, it is also
> a chance for a function to transform the key/value (keeping the class
> the same i.e the combiner semantics are not changed) and deal with a
> smaller set ( this could be done in the reducer but the number of
> values for a key might be relatively large).
> However, I guess it would be a mistake for reducer to expect its input
> coming from a combiner? E.g if there are only 10 value corresponding
> to a key(as outputted by the mapper), will these 10 values go straight
> to the reducer or to the reducer via the combiner?
> Here I am assuming my reduce operations does not need all the values
> for a key to work(so that a combiner can be used) i.e additive
> operations.
> Thank you
> Saptarshi
> On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley <omalley@apache.org>  
> wrote:
>> The Combiner may be called 0, 1, or many times on each key between  
>> the
>> mapper and reducer. Combiners are just an application specific  
>> optimization
>> that compress the intermediate output. They should not have side  
>> effects or
>> transform the types. Unfortunately, since there isn't a separate  
>> interface
>> for Combiners, there is isn't a great place to document this  
>> requirement.
>> I've just filed HADOOP-4668 to improve the documentation.
> -- 
> Saptarshi Guha - saptarshi.guha@gmail.com

View raw message