hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <chri...@yahoo-inc.com>
Subject Re: Combiner run specification and questions
Date Tue, 06 Jan 2009 23:32:47 GMT
> I agree with the requirement that the key does not change. Of  
> course, the values can change.

Yes. Key attributes not relevant to ordering are also mutable. If your  
key is a (t1, t2) tuple ordered by t1, then t2 can change without  
affecting the expected semantics.

> I am primarily worried that the combiner might not be run at all - I  
> have 'successfully' integrated Hadoop and R i.e the user can provide  
> map/reduce functions written in R. However, R is not great with  
> memory management, and if I have N (N is huge) values for a given  
> key K, then R will baulk when it comes to processing this.

That sounds really cool. The cases where the combiner is not run are  
currently obscure and future exceptions are unlikely to affect this  
particular case. For example, it doesn't make sense to run the  
combiner for a single record (i.e. there is nothing to combine), for a  
partition with no common keys, etc. So as long as the correctness of  
the computation doesn't rely on a transformation performed in the  
combiner, it should be OK. In general, it shouldn't be- and in the  
future may not be- run where there is little or no compression for it  
to effect, which doesn't exacerbate this case.

However, this restriction limits the scalability of your solution. It  
might be necessary to work around R's limitations by breaking up large  
computations into intermediate steps, possibly by explicitly  
instantiating and running the combiner in the reduce.

> 1) I am guaranteed a reducer.
> So,
>> The combiner, if defined, will run zero or more times on records  
>> emitted from the map, before being fed to the reduce.
> This zero case possibility worries me. However you mention, that it  
> occurs
>> collector spills in the map
> I have noticed this happening - what 'spilling' mean?

Records emitted from the map are serialized into a buffer, which is  
periodically written to disk when it is (sufficiently) full. Each of  
these batch writes is a "spill". In casual usage, it refers to any  
time when records need to be written to disk. The merge of  
intermediate files into the final map output and merging in-memory  
segments to disk in the reduce are two examples. -C

> Thank you
> Saptarshi
> On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:
>> The combiner, if defined, will run zero or more times on records  
>> emitted from the map, before being fed to the reduce. It is run  
>> when the collector spills in the map and in some merge cases. If  
>> the combiner transforms the key, it is illegal to change its type,  
>> the partition to which it is assigned, or its ordering.
>> For example, if you emit a record (k,v) from your map and (k',v)  
>> from the combiner, your comparator is C(K,K) and your partitioner  
>> function is P(K), it must be the case that P(k) == P(k') and  
>> C(k,k') == 0. If either of these does not hold, the semantics to  
>> the reduce are broken. Clearly, if k is not transformed (as in true  
>> for most combiners), this holds trivially.
>> As was mentioned earlier, the purpose of the combiner is to  
>> compress data pulled across the network and spilled to disk. It  
>> should not affect the correctness or, in most cases, the output of  
>> the job. -C
>> On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:
>>> Hello,
>>> I would just like to confirm, when does the Combiner run(since it
>>> might not be run at all,see below). I read somewhere that it is run,
>>> if there is at least one reduce (which in my case i can be sure of).
>>> I also read, that the combiner is an optimization. However, it is  
>>> also
>>> a chance for a function to transform the key/value (keeping the  
>>> class
>>> the same i.e the combiner semantics are not changed) and deal with a
>>> smaller set ( this could be done in the reducer but the number of
>>> values for a key might be relatively large).
>>> However, I guess it would be a mistake for reducer to expect its  
>>> input
>>> coming from a combiner? E.g if there are only 10 value corresponding
>>> to a key(as outputted by the mapper), will these 10 values go  
>>> straight
>>> to the reducer or to the reducer via the combiner?
>>> Here I am assuming my reduce operations does not need all the values
>>> for a key to work(so that a combiner can be used) i.e additive
>>> operations.
>>> Thank you
>>> Saptarshi
>>> On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley  
>>> <omalley@apache.org> wrote:
>>>> The Combiner may be called 0, 1, or many times on each key  
>>>> between the
>>>> mapper and reducer. Combiners are just an application specific  
>>>> optimization
>>>> that compress the intermediate output. They should not have side  
>>>> effects or
>>>> transform the types. Unfortunately, since there isn't a separate  
>>>> interface
>>>> for Combiners, there is isn't a great place to document this  
>>>> requirement.
>>>> I've just filed HADOOP-4668 to improve the documentation.
>>> -- 
>>> Saptarshi Guha - saptarshi.guha@gmail.com
> Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
> The way of the world is to praise dead saints and prosecute live ones.
> 		-- Nathaniel Howe

View raw message