Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of teddyyyy123@gmail.com
 designates 209.85.192.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJo9sUbs2VW0ggS608wU41CBtBRqO6djDm45rKU4LWNJeiJfTQ@mail.gmail.com>
References: 
 <CAAnh3_9yTNGMOQ0sF3qi87kMh=t44qzn9dx=m3MZqsosSv4osw@mail.gmail.com>
 <CAJo9sUbs2VW0ggS608wU41CBtBRqO6djDm45rKU4LWNJeiJfTQ@mail.gmail.com>
From: Yang <teddyyyy123@gmail.com>
Date: Thu, 30 Oct 2014 15:04:31 -0700
Message-ID: 
 <CAAnh3_-nOzOBTzLuvEcWRPy+M+pSyetizRe=68p-18O-FtZwmQ@mail.gmail.com>
Subject: Re: run arbitrary job (non-MR) on YARN ?
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a11c2c220d3211c0506ab1133

--001a11c2c220d3211c0506ab1133
Content-Type: text/plain; charset=UTF-8

thanks!

On Wed, Oct 29, 2014 at 2:38 PM, Kevin <kevin.macksamie@gmail.com> wrote:

> You can accomplish this by using the DistributedShell application that
> comes with YARN.
>
> If you copy all your archives to HDFS, then inside your shell script you
> could copy those archives to your YARN container and then execute whatever
> you want, provided all the other system dependencies exist in the container
> (correct Java version, Python, C++ libraries, etc.)
>
> For example,
>
> In myscript.sh I wrote the following:
>
> #!/usr/bin/env bash
> echo "This is my script running!"
> echo "Present working directory:"
> pwd
> echo "Current directory listing: (nothing exciting yet)"
> ls
> echo "Copying file from HDFS to container"
> hadoop fs -get /path/to/some/data/on/hdfs .
> echo "Current directory listing: (file should not be here)"
> ls
> echo "Cat ExecScript.sh (this is the script created by the
> DistributedShell application)"
> cat ExecScript.sh
>
> Run the DistributedShell application with the hadoop (or yarn) command:
>
> hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar
> /usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.3.jar
> -num_containers 1 -shell_script myscript.sh
>
> If you have the YARN log aggregation property set, then you can pipe the
> container's logs to your client console using the yarn command:
>
> yarn logs -applicationId application_1414160538995_0035
>
> (replace the application id with yours)
>
> Here is a quick reference that should help get you going:
>
> http://books.google.com/books?id=heoXAwAAQBAJ&pg=PA227&lpg=PA227&dq=hadoop+yarn+distributed+shell+application&source=bl&ots=psGuJYlY1Y&sig=khp3b3hgzsZLZWFfz7GOe2yhgyY&hl=en&sa=X&ei=0U5RVKzDLeTK8gGgoYGoDQ&ved=0CFcQ6AEwCA#v=onepage&q&f=false
>
> Hopefully this helps,
> Kevin
>
> On Mon Oct 27 2014 at 2:21:18 AM Yang <teddyyyy123@gmail.com> wrote:
>
>> I happened to run into this interesting scenario:
>>
>> I had some mahout seq2sparse jobs, originally i run them in parallel
>> using the distributed mode. but because the input files are so small,
>> running them locally actually is much faster. so I truned them to local
>> mode.
>>
>> but I run 10 of these jobs in parallel, so when 10 mahout jobs are run
>> together, everyone became very slow.
>>
>> is there an existing code that takes a desired shell script, and possibly
>> some archive files (could contain the jar file, or C++ --generated
>> executable code). I understand that I could use yarn API to code such a
>> thing, but it would be nice if I could just take it and run in shell..
>>
>> Thanks
>> Yang
>>
>

--001a11c2c220d3211c0506ab1133
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">thanks!</div><div class=3D"gmail_extra"><br><div class=3D"=
gmail_quote">On Wed, Oct 29, 2014 at 2:38 PM, Kevin <span dir=3D"ltr">&lt;<=
a href=3D"mailto:kevin.macksamie@gmail.com" target=3D"_blank">kevin.macksam=
ie@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>Y=
ou can accomplish this by using the DistributedShell application that comes=
 with YARN.</div><div><br></div><div>If you copy all your archives to HDFS,=
 then inside your shell script you could copy those archives to your YARN c=
ontainer and then execute whatever you want, provided all the other system =
dependencies exist in the container (correct Java version, Python, C++ libr=
aries, etc.)</div><div><br></div><div>For example,</div><div><br></div><div=
>In myscript.sh I wrote the following:</div><div><br></div><div><div>#!/usr=
/bin/env bash</div><div>echo &quot;This is my script running!&quot;</div><d=
iv>echo &quot;Present working directory:&quot;</div><div>pwd</div><div>echo=
 &quot;Current directory listing: (nothing exciting yet)&quot;</div><div>ls=
</div><div>echo &quot;Copying file from HDFS to container&quot;</div><div>h=
adoop fs -get /path/to/some/data/on/hdfs .</div><div>echo &quot;Current dir=
ectory listing: (file should not be here)&quot;</div><div>ls</div><div>echo=
 &quot;Cat ExecScript.sh (this is the script created by the DistributedShel=
l application)&quot;</div><div>cat ExecScript.sh</div></div><div><br></div>=
<div>Run the DistributedShell application with the hadoop (or yarn) command=
:</div><div><br></div><div>hadoop org.apache.hadoop.yarn.applications.distr=
ibutedshell.Client -jar /usr/lib/hadoop-yarn/hadoop-yarn-applications-distr=
ibutedshell-2.3.0-cdh5.1.3.jar -num_containers 1 -shell_script myscript.sh<=
br></div><div><br></div><div>If you have the YARN log aggregation property =
set, then you can pipe the container&#39;s logs to your client console usin=
g the yarn command:</div><div><br></div><div>yarn logs -applicationId appli=
cation_1414160538995_0035<br></div><div><br></div>(replace the application =
id with yours)<div><div><br></div><div><div>Here is a quick reference that =
should help get you going:</div><div><a href=3D"http://books.google.com/boo=
ks?id=3DheoXAwAAQBAJ&amp;pg=3DPA227&amp;lpg=3DPA227&amp;dq=3Dhadoop+yarn+di=
stributed+shell+application&amp;source=3Dbl&amp;ots=3DpsGuJYlY1Y&amp;sig=3D=
khp3b3hgzsZLZWFfz7GOe2yhgyY&amp;hl=3Den&amp;sa=3DX&amp;ei=3D0U5RVKzDLeTK8gG=
goYGoDQ&amp;ved=3D0CFcQ6AEwCA#v=3Donepage&amp;q&amp;f=3Dfalse" target=3D"_b=
lank">http://books.google.com/books?id=3DheoXAwAAQBAJ&amp;pg=3DPA227&amp;lp=
g=3DPA227&amp;dq=3Dhadoop+yarn+distributed+shell+application&amp;source=3Db=
l&amp;ots=3DpsGuJYlY1Y&amp;sig=3Dkhp3b3hgzsZLZWFfz7GOe2yhgyY&amp;hl=3Den&am=
p;sa=3DX&amp;ei=3D0U5RVKzDLeTK8gGgoYGoDQ&amp;ved=3D0CFcQ6AEwCA#v=3Donepage&=
amp;q&amp;f=3Dfalse</a></div><div><br></div><div>Hopefully this helps,</div=
><div>Kevin</div><div><div class=3D"h5"><div><br><div class=3D"gmail_quote"=
>On Mon Oct 27 2014 at 2:21:18 AM Yang &lt;<a href=3D"mailto:teddyyyy123@gm=
ail.com" target=3D"_blank">teddyyyy123@gmail.com</a>&gt; wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div dir=3D"ltr">I happened to run into this interes=
ting scenario:<div><br></div><div>I had some mahout seq2sparse jobs, origin=
ally i run them in parallel using the distributed mode. but because the inp=
ut files are so small, running them locally actually is much faster. so I t=
runed them to local mode.</div><div><br></div><div>but I run 10 of these jo=
bs in parallel, so when 10 mahout jobs are run together, everyone became ve=
ry slow.</div><div><br></div><div>is there an existing code that takes a de=
sired shell script, and possibly some archive files (could contain the jar =
file, or C++ --generated executable code). I understand that I could use ya=
rn API to code such a thing, but it would be nice if I could just take it a=
nd run in shell..</div><div><br></div><div>Thanks</div></div><div dir=3D"lt=
r"><div>Yang</div></div></blockquote></div></div></div></div></div></div>
</blockquote></div><br></div>

--001a11c2c220d3211c0506ab1133--