hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Combiner Problem
Date Mon, 06 Jul 2009 16:24:03 GMT

On Jul 5, 2009, at 11:34 PM, Mu Qiao wrote:

> There is a property min.num.spills.for.combine specifying the  
> minimum number of spills to run combiner when merging. The default  
> value is 3. Why there is such a restriction? Should it be better  
> that run the combiner no matter how many spills there are?

Clearly the combiner isn't useful if there is only 1 spill and 3 is a  
guess about how many are necessary before the cost of the applying the  
combiner is paid for by the resulting compression. Feel free to set it  
to 2.

> The second question is why the combiner could be run at the reduce  
> side. Can't the reduce function take place of that?

The combiners are only called on the reduce side only if there are  
enough spills that it requires more than a single merge before it can  
go to the reduce. (The reduce is only called once at the end.) So if  
the reduce has 1000 streams to merge, it will use the combiner on the  
intermediate merges before they are written to disk.

-- Owen

View raw message