hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: how to do pairwise reduce in multiple steps?!
Date Mon, 25 May 2009 23:46:32 GMT

There is not a "direct" way to set up an arbitrary-depth chain of reduces as
part of the same job. You have two basic options:

1) Do a level of mapping followed by a single level of your tree reduce.
Then follow up with an IdentityMapper forwarding your data to a second level
of your tree reduce. Repeat this part until you're done.

2) Leverage the sorted nature of keys being sent into your reducer to
topographically sort your results before they are reduced up pairwise. The
reducer task which received a set of keys receives all keys that return the
same partition number from the Partitioner. The default partitioner returns
a partition id of key.hashCode() % numReduceTasks. So by playing with the
key's hashCode() method, you could force elements of the same subtree to go
to the same reducer. The reducer will then crunch a set of keys and their
related values, in sorted order by the compareTo() method of the keys.

e.g., for the example below, you could make sure that x, y, z, and w all had
the same key.hashCode() so they arrive at the same reducer; then have their
compareTo() methods sort them in order: x, y, z, w. You can then compute
f(x, y) and store that in memory somewhere (call it 'a'), then compute f(z,
w) and store that in 'b'. Then since you know that you computed a full level
of the tree, you'd go back through your second-level array in the same
reduce task, computing f(a, b), and emitting that to the output collector.
You could also emit sentinel-values from the mapper to your reducers to
denote the end of a level of the tree, so that you don't have to count in
the reducer itself.

The amount of parallellism you can achieve is limited roughly by the amount
of memory you'll need on a task node to buffer the intermediate values.

After each reducer computes an aggregate over a subtree, you'll have only as
many results as you have reduce tasks -- maybe 64 or 128 or so -- at which
point the final aggregation can be done single-threaded in the main program.

It's a bit of a stretch, but you could probably flex MapReduce into doing
this for you.

- Aaron

On Mon, May 25, 2009 at 12:19 AM, Tim Kiefer <tim-kiefer@gmx.de> wrote:

> Hi everyone,
> I do have an application that requires to combine/reduce results in pairs -
> and the results of this step again in pairs. The binary way ;). In case it
> is not clear what is meant, have a look at the schema:
> Assume to have 4 inputs (in my case from an earlier MR job). Now the first
> two inputs need to be combined and the second two. The results need to be
> combined again.
> x ---
>    |- x*y ---
> y ---          |
>              |- x*y*z*w
> z ---          |
>    |- z*w ---
> w ---
> Note that building x*y from x and y is an expensive operation which is why
> I want to use 2 nodes to combine the 4 inputs in the first step. It is not
> an option to collect all inputs and do the combinations on 1 node.
> I guess I basically want to know whether there is a way to have logN reduce
> steps in a row to combine all inputs to just one final output?!
> Any suggestions are appreciated :)
> - Tim

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message