Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=received:from:to:date:subject:thread-topic:thread-index:
	message-id:in-reply-to:accept-language:content-language:
	x-ms-has-attach:x-ms-tnef-correlator:acceptlanguage:content-type:mime-version;
	b=qCpLUKrG1goBkylALmh7D5DhszmKKgqi2M/Bw1efP0ZJ/PhFs+0kizf40mNybjjF
From: Amogh Vasekar <amogh@yahoo-inc.com>
To: "mapreduce-user@hadoop.apache.org" <mapreduce-user@hadoop.apache.org>
Date: Thu, 21 Jan 2010 17:46:40 +0530
Subject: Re: chained mappers & reducers
Thread-Topic: chained mappers & reducers
Thread-Index: AcqZSkQZ3t2WgPCyR7KP8SgkOAVf5QAUvH/qABkKY8AAJI2UFg==
Message-ID: <C77E4380.65A2%amogh@yahoo-inc.com>
In-Reply-To: 
 <23E512539066824E9B836EA8709831A9060380A6@SM-CALA-VXMB03C.swna.wdpr.disney.com>
Accept-Language: en-US
Content-Language: en
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_C77E438065A2amoghyahooinccom_"
MIME-Version: 1.0

--_000_C77E438065A2amoghyahooinccom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Unless you can somehow guarantee that a certain output key K1 comes only fr=
om reducer R1 ( seems very unlikely & somewhat useless in your case ) , I'm=
 afraid you'll need to have a subsequent MR job. The thing is Hadoop has no=
 "in-built" mechanism for reducers to exchange data :)

Amogh


On 1/21/10 12:30 AM, "Clements, Michael" <Michael.Clements@disney.com> wrot=
e:

The use case is this: M1-R1-R2

M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count

M2: identity pass-through
R2: sum counts by K2

In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count

R2 sums the counts for each K2

The output of R1 could be fed directly into R2. But I can't find a way to d=
o that in Hadoop. So I have to create a second job, which has to have a Map=
 phase, so I create a pass-through mapper. This works but it has a lot of o=
verhead. It would be faster & cleaner to run R1 directly into R2 within the=
 same job - if possible.


From: mapreduce-user-return-302-Michael.Clements=3Ddisney.com@hadoop.apache=
.org [mailto:mapreduce-user-return-302-Michael.Clements=3Ddisney.com@hadoop=
.apache.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers

Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 =
to be aggregated in some way ) , you have to write another map-red job. If =
you need to process only local reducer data ( ie your reducer output key is=
 same as input key ),  your job would be M1-R1-M2. Essentially in Hadoop, y=
ou can have one sort and shuffle phase in one job.
Note that chain APIs are for jobs of the form (M+RM*).

Amogh


On 1/20/10 2:29 AM, "Clements, Michael" <Michael.Clements@disney.com> wrote=
:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile


--_000_C77E438065A2amoghyahooinccom_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML>
<HEAD>
<TITLE>Re: chained mappers &amp; reducers</TITLE>
</HEAD>
<BODY>
<FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:=
14pt'>Unless you can somehow guarantee that a certain output key K1 comes o=
nly from reducer R1 ( seems very unlikely &amp; somewhat useless in your ca=
se ) , I&#8217;m afraid you&#8217;ll need to have a subsequent MR job. The =
thing is Hadoop has no &#8220;in-built&#8221; mechanism for reducers to exc=
hange data :)<BR>
<BR>
Amogh<BR>
<BR>
<BR>
On 1/21/10 12:30 AM, &quot;Clements, Michael&quot; &lt;<a href=3D"Michael.C=
lements@disney.com">Michael.Clements@disney.com</a>&gt; wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"=
><FONT SIZE=3D"1"><SPAN STYLE=3D'font-size:10pt'>The use case is this: M1-R=
1-R2<BR>
&nbsp;<BR>
M1: generate K1-V1 pairs from input<BR>
R1: group by K1, generate new Keys K2 from group, with value V2, a count<BR=
>
&nbsp;<BR>
M2: identity pass-through<BR>
R2: sum counts by K2<BR>
&nbsp;<BR>
In short, R1 does this:<BR>
groups data by the K1 defined by M1<BR>
emits new keys K2, derived from the group it built<BR>
each key K2 has a count<BR>
&nbsp;<BR>
R2 sums the counts for each K2<BR>
&nbsp;<BR>
The output of R1 could be fed directly into R2. But I can&#8217;t find a wa=
y to do that in Hadoop. So I have to create a second job, which has to have=
 a Map phase, so I create a pass-through mapper. This works but it has a lo=
t of overhead. It would be faster &amp; cleaner to run R1 directly into R2 =
within the same job &#8211; if possible.<BR>
&nbsp;<BR>
&nbsp;<BR>
</SPAN></FONT><SPAN STYLE=3D'font-size:14pt'><BR>
</SPAN><FONT SIZE=3D"1"><SPAN STYLE=3D'font-size:10pt'><B>From:</B> <a href=
=3D"mapreduce-user-return-302-Michael.Clements=3Ddisney.com@hadoop.apache.o=
rg">mapreduce-user-return-302-Michael.Clements=3Ddisney.com@hadoop.apache.o=
rg</a> [<a href=3D"mailto:mapreduce-user-return-302-Michael.Clements=3Ddisn=
ey.com@hadoop.apache.org">mailto:mapreduce-user-return-302-Michael.Clements=
=3Ddisney.com@hadoop.apache.org</a>] <B>On Behalf Of </B>Amogh Vasekar<BR>
<B>Sent:</B> Tuesday, January 19, 2010 10:53 PM<BR>
<B>To:</B> <a href=3D"mapreduce-user@hadoop.apache.org">mapreduce-user@hado=
op.apache.org</a><BR>
<B>Subject:</B> Re: chained mappers &amp; reducers<BR>
</SPAN></FONT></FONT><FONT SIZE=3D"2"><FONT FACE=3D"Times New Roman"><SPAN =
STYLE=3D'font-size:12pt'> <BR>
</SPAN></FONT></FONT><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><SPA=
N STYLE=3D'font-size:14pt'>Hi,<BR>
Can you elaborate on your case a little?<BR>
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 =
to be aggregated in some way ) , you have to write another map-red job. If =
you need to process only local reducer data ( ie your reducer output key is=
 same as input key ), &nbsp;your job would be M1-R1-M2. Essentially in Hado=
op, you can have one sort and shuffle phase in one job.<BR>
Note that chain APIs are for jobs of the form (M+RM*). &nbsp;<BR>
<BR>
Amogh<BR>
<BR>
<BR>
On 1/20/10 2:29 AM, &quot;Clements, Michael&quot; &lt;<a href=3D"Michael.Cl=
ements@disney.com">Michael.Clements@disney.com</a>&gt; wrote:<BR>
These two classes are not really symmetric as the name suggests.<BR>
ChainedMapper does what I expected: chains multiple map steps. But<BR>
ChainedReducer does not chain reducer steps. It chains map steps to<BR>
follow a reduce step. At least, that is my understanding given the API<BR>
docs &amp; examples I've read.<BR>
<BR>
Is there a way to chain multiple reducer steps? I've got a job that<BR>
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,<BR>
where M2 is an identity pass-through mapper. If there were a way to<BR>
chain 2 reduce steps the way ChainedMapper chains map steps, I could<BR>
make this into a one-pass job, eliminating the overhead of a second job<BR>
and all the unnecessary I/O.<BR>
<BR>
Thanks<BR>
<BR>
Michael Clements<BR>
Solutions Architect<BR>
<a href=3D"michael.clements@disney.com">michael.clements@disney.com</a><BR>
206 664-4374 office<BR>
360 317 5051 mobile<BR>
<BR>
<BR>
<BR>
</SPAN></FONT></BLOCKQUOTE>
</BODY>
</HTML>


--_000_C77E438065A2amoghyahooinccom_--