hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lukavsk√Ĺ <jan.lukav...@firma.seznam.cz>
Subject M/R API and Writable semantics in reducer
Date Mon, 02 Sep 2013 13:29:40 GMT
Hi all,

some time ago, I wrote a note to this conference, that it would be nice 
if it would be possible to get the *real* key emitted from mapper to 
reducer, when using the GroupingComparator. I got the answer, that it is 
possible, because of the Writable semantics and that currently the 
following holds:

@Override
protected void reduce(Key key, Iterable<Value> values, Context context)
{
   for (Value v : values) {
     // The key MIGHT change its value in this cycle, because 
readFields() will be called on it.
     // When using GroupingComparator that groups only by some part of 
the key,
     // many different keys might be considered single group, so the 
*real* data matters.
   }
}

When you use GroupingComparator the contents of the key can matter, 
because if you cannot access it, you have to duplicate the data in value 
(which means more network traffic in shuffle phase, and more I/O generally).

Now, the question is, how much is this a matter of API that is reliable, 
or how much it is likely, that relying on this feature might break in 
future versions. To me, it seems more like a side effect, that is not 
guaranteed to be maintained in the future. There already exists a 
suggestion, that this is probably very fragile, because MRUnit seems not 
to update the key during the iteration.

Does anyone have any suggested way around? Is the 'official' preferred 
way of accessing the original key to call context.getCurrentKey()? Isn't 
this the same case? Wouldn't it be nice, if the API itself had some 
guaranties or suggestions how it works? I can imagine modified reduce() 
metod, with a signature like

protected void reduce(Key key, Iterable<Pair<Key, Value>> keyValues, 
Context context);

This seems easily transformable to the old call (which could be default 
implementation of this method).

Any opinion on this?

Thanks,
  Jan

Mime
View raw message