hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: About the combiner execution
Date Sun, 10 Jul 2011 20:37:34 GMT
(Moving to mapreduce-user@, bcc hdfs-user@. Please use appropriate project lists - thanks)

On Jul 10, 2011, at 4:42 AM, Florin P wrote:

> Hello!
>  I've read on http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html
(cite):
> "The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner.
Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should
not depend on the combiners execution. "
> Is it true? 

Right. The way to visualize is that the MR framework in the map task collects the 'raw' (i.e.
serialized) map-output key-values in the 'sort' buffer. When the buffer is full it runs the
combiner (if available) and then spills it to the disk, even the last (final) spill. The combiner
is also run when the multiple spills from disk need to be merged. 

However, the combiner execution also depends on having sufficient number of records to combine
- this is because combiner execution is somewhat expensive since we need a extra serialize-deserialize
pair.

Thus, the combiner maybe be run 0 or more times. 

> Also is it possible to use the Combiner without the Reducer? The framework will take
into the consideration the Combiner in this case?


No. When the job has no reduces the map-outputs are written straight to HDFS (typically) without
sorting them. Thus, combiners are never in that execution path.

hth,
Arun
Mime
View raw message