hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to use SequentialFileOutputformat as the output format and to choose key/value classes that are different from those for map output.
Date Sun, 02 Apr 2006 20:29:54 GMT
Eric Baldeschwieler wrote:
> I can not think of a case where this proposed extension complicates  
> code or reduces compressibility.  Since it is backwards compatible  with 
> your desired API, purists can simply ignore the option.

It makes the insertion of a combiner no longer transparent.  The reducer 
would have to know whether a combiner had been used in order to know how 
to process the map output.

In general this seems like a micro-optimization.  It saves little code. 
  Instead of writing 'collector.collect(key, new List(value))' one could 
write 'collector.collect(key, value)'.

Taking this to its logical extreme, in the classic word-count use of 
MapReduce, why should one have to emit ones for the map values?  Why 
have a value at all?  Why not add a collect(key) method, then permit 
reducers to be passed an iterator which returns null for all values 
where collect(key) was called.  That would save a little code and make 
the intermediate data a bit smaller.  So should we do it?  I'd argue not.

Doug

Mime
View raw message