Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CANSvDjpP1gA2byH8H6beZz1Og52y5hkXY=NeyyftpOAeZ2B9rg@mail.gmail.com>
References: 
 <CANrNQ69ocmBpnj7B4vQg=7tMUhZmjb0Q6gf7FTvyDtk123zfvw@mail.gmail.com>
 <4408146210581464893@unknownmsgid>
 <CANrNQ69J=tp_dx1J4pf-b5EE5ukbEFQ9ZOFD0AOB5rRrPUuASQ@mail.gmail.com>
 <CABGo67VtqyRhf+mduhzLgx3FNOCFD3nhAF0aKLS-DNXn676Oiw@mail.gmail.com>
 <CACBYxK+T26eMgQDRGxca1ykrtH6PHk447_otX40GH=cPTPD2Nw@mail.gmail.com>
 <CANSvDjpP1gA2byH8H6beZz1Og52y5hkXY=NeyyftpOAeZ2B9rg@mail.gmail.com>
From: Ted Dunning <tdunning@maprtech.com>
Date: Mon, 4 Mar 2013 14:43:06 -0500
Message-ID: 
 <CAND0qzsEBjuXuFWb6u77M8uQA22-1Ek+cV2R0h_HUw=yixMr3A@mail.gmail.com>
Subject: Re: Accumulo and Mapreduce
To: "common-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=047d7ba97b720e13c804d71e9294

--047d7ba97b720e13c804d71e9294
Content-Type: text/plain; charset=ISO-8859-1

Chaining the jobs is a fantastically inefficient solution.  If you use Pig
or Cascading, the optimizer will glue all of your map functions into a
single mapper.  The result is something like:

    (mapper1 -> mapper2 -> mapper3) => reducer

Here the parentheses indicate that all of the map functions are executed as
a single function formed by composing mapper1, mapper2, and mapper3.
 Writing multiple jobs to do this forces *lots* of unnecessary traffic to
your persistent store and lots of unnecessary synchronization.

You can do this optimization by hand, but using a higher level language is
often better for maintenance.


On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <russell.jurney@gmail.com>wrote:

> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
> or Hive. You can do this is a couple lines of code, I suspect. Two map
> reduce jobs should not pose any kind of challenge with the right tools.
>
>
> On Monday, March 4, 2013, Sandy Ryza wrote:
>
>> Hi Aji,
>>
>> Oozie is a mature project for managing MapReduce workflows.
>> http://oozie.apache.org/
>>
>> -Sandy
>>
>>
>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <justin.woody@gmail.com>wrote:
>>
>>> Aji,
>>>
>>> Why don't you just chain the jobs together?
>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>
>>> Justin
>>>
>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aji1705@gmail.com> wrote:
>>> > Russell thanks for the link.
>>> >
>>> > I am interested in finding a solution (if out there) where Mapper1
>>> outputs a
>>> > custom object and Mapper 2 can use that as input. One way to do this
>>> > obviously by writing to Accumulo, in my case. But, is there another
>>> solution
>>> > for this:
>>> >
>>> > List<MyObject> ----> Input to Job
>>> >
>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>> <MyObjectId,
>>> > MyObject>
>>> >
>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>> >
>>> >
>>> >
>>> > Ideas?
>>> >
>>> >
>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>> russell.jurney@gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >>
>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>> >>
>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>> try
>>> >> it.
>>> >>
>>> >> Russell Jurney http://datasyndrome.com
>>> >>
>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <aji1705@gmail.com> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>> output goes
>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>> >>
>>> >> Questions:
>>> >>
>>> >> 1) Has any one tried something like this before? Are there any
>>> workflow
>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>> job like
>>> >> this. Or am I limited to use Quartz for this?
>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>> mapreduce
>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>> (starting
>>> >> point/best practices).
>>> >>
>>> >> Thank you in advance for any suggestions!
>>> >>
>>> >> Aji
>>> >>
>>> >
>>>
>>
>>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>

--047d7ba97b720e13c804d71e9294
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Chaining the jobs is a fantastically inefficient solution.=
 =A0If you use Pig or Cascading, the optimizer will glue all of your map fu=
nctions into a single mapper. =A0The result is something like:<div><br></di=
v><div style>

=A0 =A0 (mapper1 -&gt; mapper2 -&gt; mapper3) =3D&gt; reducer</div><div sty=
le><br></div><div style>Here the parentheses indicate that all of the map f=
unctions are executed as a single function formed by composing mapper1, map=
per2, and mapper3. =A0Writing multiple jobs to do this forces *lots* of unn=
ecessary traffic to your persistent store and lots of unnecessary synchroni=
zation.</div>

<div style><br></div><div style>You can do this optimization by hand, but u=
sing a higher level language is often better for maintenance.</div></div><d=
iv class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon, Mar 4, =
2013 at 1:52 PM, Russell Jurney <span dir=3D"ltr">&lt;<a href=3D"mailto:rus=
sell.jurney@gmail.com" target=3D"_blank">russell.jurney@gmail.com</a>&gt;</=
span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">You can chain MR jobs with Oozie, but would =
suggest using Cascading,=A0Pig or Hive<span></span>. You can do this is a c=
ouple lines of code, I suspect. Two map reduce jobs should not pose any kin=
d of challenge with the right tools.<div class=3D"HOEnZb">

<div class=3D"h5"><br>
<br>On Monday, March 4, 2013, Sandy Ryza  wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">Hi Aji,<div><br></div><div>Oozie is a mature project for managing M=
apReduce workflows.</div>


<div><a href=3D"http://oozie.apache.org/" target=3D"_blank">http://oozie.ap=
ache.org/</a></div><div><br></div><div>-Sandy</div><div><br><br><div class=
=3D"gmail_quote">
On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <span dir=3D"ltr">&lt;<a>justi=
n.woody@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Aji,<br>
<br>
Why don&#39;t you just chain the jobs together?<br>
<a href=3D"http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining=
" target=3D"_blank">http://developer.yahoo.com/hadoop/tutorial/module4.html=
#chaining</a><br>
<span><font color=3D"#888888"><br>
Justin<br>
</font></span><div><div><br>
On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis &lt;<a>aji1705@gmail.com</a>&gt;=
 wrote:<br>
&gt; Russell thanks for the link.<br>
&gt;<br>
&gt; I am interested in finding a solution (if out there) where Mapper1 out=
puts a<br>
&gt; custom object and Mapper 2 can use that as input. One way to do this<b=
r>
&gt; obviously by writing to Accumulo, in my case. But, is there another so=
lution<br>
&gt; for this:<br>
&gt;<br>
&gt; List&lt;MyObject&gt; ----&gt; Input to Job<br>
&gt;<br>
&gt; MyObject ---&gt; Input to Mapper1 (process MyObject) ----&gt; Output &=
lt;MyObjectId,<br>
&gt; MyObject&gt;<br>
&gt;<br>
&gt; &lt;MyObjectId, MyObject&gt; are Input to Mapper2 ... and so on<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Ideas?<br>
&gt;<br>
&gt;<br>
&gt; On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney &lt;<a>russell.jurney@=
gmail.com</a>&gt;<br>
&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; <a href=3D"http://svn.apache.org/repos/asf/accumulo/contrib/pig/tr=
unk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java" target=3D"_=
blank">http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/=
java/org/apache/accumulo/pig/AccumuloStorage.java</a><br>


&gt;&gt;<br>
&gt;&gt; AccumuloStorage for Pig comes with Accumulo. Easiest way would be =
to try<br>
&gt;&gt; it.<br>
&gt;&gt;<br>
&gt;&gt; Russell Jurney <a href=3D"http://datasyndrome.com" target=3D"_blan=
k">http://datasyndrome.com</a><br>
&gt;&gt;<br>
&gt;&gt; On Mar 4, 2013, at 5:30 AM, Aji Janis &lt;<a>aji1705@gmail.com</a>=
&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; Hello,<br>
&gt;&gt;<br>
&gt;&gt; =A0I have a MR job design with a flow like this: Mapper1 -&gt; Map=
per2 -&gt;<br>
&gt;&gt; Mapper3 -&gt; Reducer1. Mapper1&#39;s input is an accumulo table. =
M1&#39;s output goes<br>
&gt;&gt; to M2.. and so on. Finally the Reducer writes output to Accumulo.<=
br>
&gt;&gt;<br>
&gt;&gt; Questions:<br>
&gt;&gt;<br>
&gt;&gt; 1) Has any one tried something like this before? Are there any wor=
kflow<br>
&gt;&gt; control apis (in or outside of Hadoop) that can help me set up the=
 job like<br>
&gt;&gt; this. Or am I limited to use Quartz for this?<br>
&gt;&gt; 2) If both M2 and M3 needed to write some data to two same tables =
in<br>
&gt;&gt; Accumulo, is it possible to do so? Are there any good accumulo map=
reduce<br>
&gt;&gt; jobs you can point me to? blogs/pages that I can use for reference=
 (starting<br>
&gt;&gt; point/best practices).<br>
&gt;&gt;<br>
&gt;&gt; Thank you in advance for any suggestions!<br>
&gt;&gt;<br>
&gt;&gt; Aji<br>
&gt;&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div>
</blockquote><br><br></div></div><span class=3D"HOEnZb"><font color=3D"#888=
888">-- <br><span style=3D"font-family:arial,sans-serif;font-size:14px">Rus=
sell Jurney=A0</span><a href=3D"http://twitter.com/rjurney" style=3D"font-f=
amily:arial,sans-serif;font-size:14px;color:rgb(0,0,204)" target=3D"_blank"=
>twitter.<span style=3D"background-color:rgb(255,255,136);color:rgb(34,34,3=
4)">com</span>/rjurney</a><span style=3D"font-family:arial,sans-serif;font-=
size:14px">=A0</span><font color=3D"#888888" style=3D"font-family:arial,san=
s-serif;font-size:14px"><a href=3D"mailto:russell.jurney@gmail.com" style=
=3D"color:rgb(0,0,204)" target=3D"_blank"><font color=3D"#0000cc" style=3D"=
color:rgb(0,0,204)">russell.jurney@gmail.</font><span style=3D"background-c=
olor:rgb(255,255,136);color:rgb(34,34,34)">com</span></a>=A0<a href=3D"http=
://datasyndrome.com/" style=3D"color:rgb(0,0,204)" target=3D"_blank"><span =
style=3D"background-color:rgb(255,255,136);color:rgb(34,34,34)">datasyndrom=
e</span>.<span style=3D"background-color:rgb(255,255,136);color:rgb(34,34,3=
4)">com</span></a></font><br>


</font></span></blockquote></div><br></div>

--047d7ba97b720e13c804d71e9294--