Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of chivas314159@gmail.com
 designates 209.85.128.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <8ED5ED25-2DE0-43FC-A1F8-F37F79266AEA@apache.org>
References: 
 <CAPZM1TfEWc4hTgA0pdFsc4T8-hwcnZ3fUx4r27TydUO_S9svvw@mail.gmail.com>
	<8ED5ED25-2DE0-43FC-A1F8-F37F79266AEA@apache.org>
Date: Tue, 17 Sep 2013 14:23:40 +0100
Message-ID: 
 <CAPZM1Tce1HiX=5ghO7sCL9X4ROLprPb8QwU2u1ADCLKkUCLcog@mail.gmail.com>
Subject: Re: chaining (the output of) jobs/ reducers
From: Adrian CAPDEFIER <chivas314159@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7bb04734a6cc1304e6943a0b

--047d7bb04734a6cc1304e6943a0b
Content-Type: text/plain; charset=ISO-8859-1

I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.


On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

>
> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
>
> It works on top of YARN, though the first release of Tez is yet to happen.
>
> You can learn about it more here: http://tez.incubator.apache.org/
>
> HTH,
> +Vinod
>
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
>
> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

--047d7bb04734a6cc1304e6943a0b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I&#39;ve just seen your email, Vinod. This is the behaviou=
r that I&#39;d expect and similar to other data integration tools; I will k=
eep an eye out for it as a long term option.<br></div><div class=3D"gmail_e=
xtra">
<br><br><div class=3D"gmail_quote">On Fri, Sep 13, 2013 at 5:26 AM, Vinod K=
umar Vavilapalli <span dir=3D"ltr">&lt;<a href=3D"mailto:vinodkv@apache.org=
" target=3D"_blank">vinodkv@apache.org</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">
<div style=3D"word-wrap:break-word"><div><br></div><div>Other than the shor=
t term solutions that others have proposed, Apache Tez solves this exact pr=
oblem. It can M-M-R-R-R chains, and mult-way mappers and reducers, and your=
 own custom processors - all without persisting the intermediate outputs to=
 HDFS.</div>
<div><br></div><div>It works on top of YARN, though the first release of Te=
z is yet to happen.</div><div><br></div>You can learn about it more here:=
=A0<a href=3D"http://tez.incubator.apache.org/" target=3D"_blank">http://te=
z.incubator.apache.org/</a><div>
<br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;te=
xt-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norm=
al;border-collapse:separate;text-transform:none;font-size:medium;white-spac=
e:normal;font-family:Helvetica;word-spacing:0px"><div>
<div>HTH,</div><div>+Vinod<br></div></div></span>
</div><div><div class=3D"h5">

<br><div><div>On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:</div><br=
><blockquote type=3D"cite"><div dir=3D"ltr"><div><div><div>Howdy,<br><br></=
div>My application requires 2 distinct processing steps (reducers) to be pe=
rformed on the input data. The first operation generates changes the key va=
lues and, records that had different keys in step 1 can end up having the s=
ame key in step 2.<br>

<br></div><div>The heavy lifting of the operation is in step1 and step2 onl=
y combines records where keys were changed.<br></div><div><br></div><div>In=
 short the overview is:<br></div><div>Sequential file -&gt; Step 1 -&gt; St=
ep 2 -&gt; Output.<br>

</div><div><br><br>To implement this in hadoop, it seems that I need to cre=
ate a separate job for each step. <br><br>Now I assumed, there would some s=
ort of job management under hadoop to=20
link Job 1 and 2, but the only thing I could find was related to job=20
scheduling and nothing on how to synchronize the input/output of the linked=
 jobs.<br><br><br><br>The only crude solution that I can think of is to use=
 a temporary file under HDFS, but even so I&#39;m not sure if this will wor=
k.<br>

<br></div><div>The overview of the process would be:<br>Sequential Input (l=
ines) =3D&gt; Job A[Mapper (key1, value1) =3D&gt; ChainReducer (key2, value=
2)] =3D&gt; Temporary file =3D&gt; Job B[Mapper (key2, value2) =3D&gt; Redu=
cer (key2, value 3)] =3D&gt; output.<br>

</div><br></div><div>Is there a better way to pass the output from Job A as=
 input to Job B (e.g. using network streams or some built in java classes t=
hat don&#39;t do disk i/o)? <br><br><br><br>The temporary file solution wil=
l work in a single node configuration, but I&#39;m not sure about an MPP co=
nfig.<br>

</div><div><br>Let&#39;s say Job A runs on nodes 0 and 1 and job B runs on =
nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redist=
ribute automagically the records between nodes or does this need to be code=
d somehow? <br>

</div></div>
</blockquote></div><br></div></div></div></div>
<br>
<span style=3D"color:rgb(128,128,128);font-family:Arial,sans-serif;font-siz=
e:10px">CONFIDENTIALITY NOTICE</span><br style=3D"color:rgb(128,128,128);fo=
nt-family:Arial,sans-serif;font-size:10px"><span style=3D"color:rgb(128,128=
,128);font-family:Arial,sans-serif;font-size:10px">NOTICE: This message is =
intended for the use of the individual or entity to which it is addressed a=
nd may contain information that is confidential, privileged and exempt from=
 disclosure under applicable law. If the reader of this message is not the =
intended recipient, you are hereby notified that any printing, copying, dis=
semination, distribution, disclosure or forwarding of this communication is=
 strictly prohibited. If you have received this communication in error, ple=
ase contact the sender immediately and delete it from your system. Thank Yo=
u.</span></blockquote>
</div><br></div>

--047d7bb04734a6cc1304e6943a0b--