hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@yahoo-inc.com>
Subject RE: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Date Sun, 02 Apr 2006 23:27:59 GMT

The argument of using local combiners is interesting. To me, combiner class
is just another layer of transformer.  It does not mean that the combiner
class has to be the same as the reducer class. The only criteria is that
they meet the associate rule:  
	Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then 
	Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and 
	Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
same.

A special (maybe very common) scenario is that combiner and reducer are the
same class and reduce function is associate. However, this needs not to be
the case in general. And the class of the reduce outputs need not to be the
same as that of the combiner, if the combiner and the reducer are not the
same class.
 

Runping

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Sunday, April 02, 2006 1:30 PM
To: hadoop-dev@lucene.apache.org
Subject: Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.

Eric Baldeschwieler wrote:
> I can not think of a case where this proposed extension complicates  
> code or reduces compressibility.  Since it is backwards compatible  with 
> your desired API, purists can simply ignore the option.

It makes the insertion of a combiner no longer transparent.  The reducer 
would have to know whether a combiner had been used in order to know how 
to process the map output.

In general this seems like a micro-optimization.  It saves little code. 
  Instead of writing 'collector.collect(key, new List(value))' one could 
write 'collector.collect(key, value)'.

Taking this to its logical extreme, in the classic word-count use of 
MapReduce, why should one have to emit ones for the map values?  Why 
have a value at all?  Why not add a collect(key) method, then permit 
reducers to be passed an iterator which returns null for all values 
where collect(key) was called.  That would save a little code and make 
the intermediate data a bit smaller.  So should we do it?  I'd argue not.

Doug



Mime
View raw message