hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John M Cieslewicz <jmcie...@us.ibm.com>
Subject Re: A Case for Stronger Partial Aggregation Semantics in Hadoop
Date Mon, 30 Jul 2007 22:33:28 GMT
"Owen O'Malley" <oom@yahoo-inc.com> wrote on 07/30/2007 02:31:04 PM:

>
> On Jul 26, 2007, at 5:02 PM, John M Cieslewicz wrote:
> > The combiner semantics, however, are the same as the reducer’s and
> > there is nothing to prevent a programmer from implementing a
> > combiner that
> > changes the value of the key or outputs more or less than one key-
> > value
> > pair.
>
> The combiner and reducer share an interface. However, the semantics
> are different. In particular,
>    1. Combiners may be invoked once or many times on each of the map
> outputs, while reduces will be invoked exactly once on each key.
>    2. As a result of that, combiners effectively can not have side
> effects, while reduces can.
>    3. Reduces can emit different types than their inputs, combiners
> can not.
>    4. Reduces can change the key, while combiners are required not
> to. Currently this is not checked dynamically, although it should be.
> (Things will break badly if combiners do this...)

Rather than checking this dynamically, I think it would be easier and
clearer for the combiner programmer to define the combiner interface along
the lines suggested by Doug:
public interface Combiner {
   /** Combine all values passed into a single value that is returned. */
   public Writable combine(WritableComparable key, Iterator values);
}:
If such a change seems reasonable, I would be happy to implement the
necessary changes.

>
> Note that currently Hadoop invokes the combiner exactly once. There
> is a jira issue filed to fix that. *smile*
>

Could you point me to the jira issue related to this? I have already
implemented some things that are potentially related to this such as
combining across map spills during the merge at a completion of a map task.

> > This leads to a number of limitations, chief among them the fact
> > that the
> > combiner cannot be applied more than once because there are no
> > guarantees
> > regarding the effects of repeatedly using the combiner (as
> > implemented, the
> > combiner could produce more than one output pair or change the key).
>
> As I said in the previous point, the combiner can be invoked more
> than once and should be. It currently does not. Applications are
> required to keep the combiners pure. I hope it does not break too
> many applications when we fix this.
>
> > A summary of desirable semantics:
> >    1 The map function produces as output partial aggregate values
> >       representing singletons.
> >    2 A new combiner function that explicitly performs partial to
> > partial
> >       aggregation over one or more values, creating one new output
> > value of
> >       the same type as the input value and not changing the key.
> >    3 A reducer which takes as input partial aggregates and produces
> > final
> >       values of any format.
>
> Basically, we already have this, except that we allow the combiner to
> emit multiple records. Multiple records out of the combiner is not as
> clean, but in practice I don't think it hurts anything.
>
A potential problem caused by allowing a combiner to output multiple
records is that it could break future optimizations. With a combiner
defined, one could, for instance, pipeline some combining within the
reducer using a means other than sorting such as a tree or hash table. In
those cases, one might require a combiner to produce a single new value for
the given key.

-John
Mime
View raw message