hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: chained mappers & reducers
Date Thu, 21 Jan 2010 12:16:40 GMT
Unless you can somehow guarantee that a certain output key K1 comes only from reducer R1 (
seems very unlikely & somewhat useless in your case ) , I'm afraid you'll need to have
a subsequent MR job. The thing is Hadoop has no "in-built" mechanism for reducers to exchange
data :)


On 1/21/10 12:30 AM, "Clements, Michael" <Michael.Clements@disney.com> wrote:

The use case is this: M1-R1-R2

M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count

M2: identity pass-through
R2: sum counts by K2

In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count

R2 sums the counts for each K2

The output of R1 could be fed directly into R2. But I can't find a way to do that in Hadoop.
So I have to create a second job, which has to have a Map phase, so I create a pass-through
mapper. This works but it has a lot of overhead. It would be faster & cleaner to run R1
directly into R2 within the same job - if possible.

From: mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org [mailto:mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org]
On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers

Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated
in some way ) , you have to write another map-red job. If you need to process only local reducer
data ( ie your reducer output key is same as input key ),  your job would be M1-R1-M2. Essentially
in Hadoop, you can have one sort and shuffle phase in one job.
Note that chain APIs are for jobs of the form (M+RM*).


On 1/20/10 2:29 AM, "Clements, Michael" <Michael.Clements@disney.com> wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.


Michael Clements
Solutions Architect
206 664-4374 office
360 317 5051 mobile

View raw message