hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saptarshi Guha <saptarshi.g...@gmail.com>
Subject Re: Combiner run specification and questions
Date Tue, 06 Jan 2009 15:23:38 GMT
I agree with the requirement that the key does not change. Of course,  
the values can change.
I am primarily worried that the combiner might not be run at all - I  
have 'successfully' integrated Hadoop and R i.e the user can provide  
map/reduce functions written in R. However, R is not great with memory  
management, and if I have N (N is huge) values for a given key K, then  
R will baulk when it comes to processing this.
Thus the combiner. The combiner will process n values for K, and  
ultimately, a few values for K in the reducer . If the combiner where  
not to run, R would collapse under the load.

1) I am guaranteed a reducer.
> The combiner, if defined, will run zero or more times on records  
> emitted from the map, before being fed to the reduce.

This zero case possibility worries me. However you mention, that it  
> collector spills in the map

I have noticed this happening - what 'spilling' mean?

Thank you

On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:

> The combiner, if defined, will run zero or more times on records  
> emitted from the map, before being fed to the reduce. It is run when  
> the collector spills in the map and in some merge cases. If the  
> combiner transforms the key, it is illegal to change its type, the  
> partition to which it is assigned, or its ordering.
> For example, if you emit a record (k,v) from your map and (k',v)  
> from the combiner, your comparator is C(K,K) and your partitioner  
> function is P(K), it must be the case that P(k) == P(k') and C(k,k')  
> == 0. If either of these does not hold, the semantics to the reduce  
> are broken. Clearly, if k is not transformed (as in true for most  
> combiners), this holds trivially.
> As was mentioned earlier, the purpose of the combiner is to compress  
> data pulled across the network and spilled to disk. It should not  
> affect the correctness or, in most cases, the output of the job. -C
> On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:
>> Hello,
>> I would just like to confirm, when does the Combiner run(since it
>> might not be run at all,see below). I read somewhere that it is run,
>> if there is at least one reduce (which in my case i can be sure of).
>> I also read, that the combiner is an optimization. However, it is  
>> also
>> a chance for a function to transform the key/value (keeping the class
>> the same i.e the combiner semantics are not changed) and deal with a
>> smaller set ( this could be done in the reducer but the number of
>> values for a key might be relatively large).
>> However, I guess it would be a mistake for reducer to expect its  
>> input
>> coming from a combiner? E.g if there are only 10 value corresponding
>> to a key(as outputted by the mapper), will these 10 values go  
>> straight
>> to the reducer or to the reducer via the combiner?
>> Here I am assuming my reduce operations does not need all the values
>> for a key to work(so that a combiner can be used) i.e additive
>> operations.
>> Thank you
>> Saptarshi
>> On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley <omalley@apache.org>  
>> wrote:
>>> The Combiner may be called 0, 1, or many times on each key between  
>>> the
>>> mapper and reducer. Combiners are just an application specific  
>>> optimization
>>> that compress the intermediate output. They should not have side  
>>> effects or
>>> transform the types. Unfortunately, since there isn't a separate  
>>> interface
>>> for Combiners, there is isn't a great place to document this  
>>> requirement.
>>> I've just filed HADOOP-4668 to improve the documentation.
>> -- 
>> Saptarshi Guha - saptarshi.guha@gmail.com

Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
The way of the world is to praise dead saints and prosecute live ones.
		-- Nathaniel Howe

View raw message