hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shevek <she...@karmasphere.com>
Subject Re: Grouping in Combiners
Date Mon, 31 Oct 2011 20:37:28 GMT
On 31 October 2011 13:20, Harsh J <harsh@cloudera.com> wrote:

> Shevek,
> The problem Mathias indicates here is that the Combiners do not utilize
> the Grouping Comparators. They only use the Sort Comparators. Is that
> probably a bug is what I wonder.

Hm, that's a rather interesting definitional question.

The pure structure is: ... -> scatter -> gather -> ...

For any sort of structural messing around, we're allowed to do anything we
like with the scatter-gather pair, as long as there are no visible effects
on the input or output. So...

For clusters, it becomes ... -> scatter -> shuffle -> gather -> ... because
"shuffle -> gather" is a distributed gather. Sorting is a secondary issue
which only becomes visible to the reducer after the gather. Google modified
their mapreduce to make sorting optional, IIRC, made Tenzing a lot faster,
so let's do that:

... -> scatter -> gather [-> sort-within-group] -> ...

Square brackets are optional, as usual. So, the use of the sorting
comparator for the scatter-gather is a mathematical oddity. We are now free
to construct a gather using any mechanism which produces groups, e.g. bins,
etc. (Note general case that arbitrary values can't be gathered faster than
sorting still applies.)

The combiner is inserted between scatter and shuffle. If we construct:

... -> scatter -> partial-gather [[-> sort-within-group] -> combine] ->
scatter[-groups] -> gather[-groups] [-> sort-within-groups] -> ...

then perhaps the use of the sorting comparator for combiners really is a
bug. Note the extra sort-within-group is optional in case the combiner
needs the sort.

If we use the above structural layout of mapreduce, we can also remove the
restriction that the sorting and grouping comparators must agree on the
subset of the key on which they compare, which I _think_ is present
(correct me as usual, please).

Perhaps we should pin a copy of this diatribe on the back of the toilet
door in case anyone else knows the answer, but in conclusion, I think it
_is_ a bug.


> On 31-Oct-2011, at 11:14 PM, Shevek wrote:
> > I like the ability to reuse a Java component for both sorting and
> grouping,
> > and to be honest, since the cases where one can do a comparison without
> > deserializing the raw bytes are relatively few and far between, I tend to
> > use java's Comparator interface, and wrap it in some
> > infrastructure-specific adapter. I have a vague feeling that Hadoop
> > sometimes calls the byte interface and sometimes the object interface
> > anyway? ICBW, the way I've been writing code makes it irrelevant.
> >
> > Alternatively, I've misunderstood the (simpler) question, and the answer
> is
> > to use the setGroupingComparatorClass() API.
> >
> > S.
> >
> > On 29 October 2011 04:35, Mathias Herberts <mathias.herberts@gmail.com
> >wrote:
> >
> >> Another point concerning the Combiners,
> >>
> >> the grouping is currently done using the RawComparator used for
> >> sorting the Mapper's output. Wouldn't it be useful to be able to set a
> >> custom CombinerGroupingComparatorClass?
> >>
> >> Mathias.
> >>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message