Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ojoshi@hortonworks.com
 designates 209.85.215.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFwH+a_204XFdrLnZa9d_ovaqtXxtugV+6r4arQ8uq6UMtYb+w@mail.gmail.com>
References: 
 <CAFwH+a-YXrSY4o3Puz78z7xX=WPQsOtPBH-HefoU9gBN5c_57Q@mail.gmail.com>
	<06006DDA5A27D541991944AC4117E7A96E1C9888@szxeml560-mbx.china.huawei.com>
	<CAFwH+a_204XFdrLnZa9d_ovaqtXxtugV+6r4arQ8uq6UMtYb+w@mail.gmail.com>
Date: Mon, 1 Jul 2013 11:19:20 -0700
Message-ID: 
 <CABcwWri1prX+hUPVFxJvRXieMcrk+TLh4POE7vsSuAJhEd4T3Q@mail.gmail.com>
Subject: Re: Yarn HDFS and Yarn Exceptions when processing "larger" datasets.
From: Omkar Joshi <ojoshi@hortonworks.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e0158b20266969404e07744cf

--089e0158b20266969404e07744cf
Content-Type: text/plain; charset=ISO-8859-1

Hi,

As I don't know your complete AM code and how your containers are
communicating with each other...Certain things which might help you in
debugging.... where you are starting your RM (is it really running on
8030???? are you sure there is no previously started RM still running
there?) Also in yarn-site.xml can you try changing RM address to something
like "localhost:<free-port-but-not-default>" and configure maximum client
thread size for handling AM requests? only your AM is expected to
communicate with RM on AM-RM protocol.. by any chance in your code; are
containers directly communicating with RM on AM-RM protocol??

  <property>

    <description>The address of the scheduler interface.</description>

    <name>yarn.resourcemanager.scheduler.address</name>

    <value>${yarn.resourcemanager.hostname}:8030</value>

  </property>


  <property>

    <description>Number of threads to handle scheduler interface.</
description>

    <name>yarn.resourcemanager.scheduler.client.thread-count</name>

    <value>50</value>

  </property>


Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Fri, Jun 28, 2013 at 5:35 AM, blah blah <tmp5330@gmail.com> wrote:

> Hi
>
> Sorry to reply so late. I don't have the data you requested (sorry I have
> no time, my deadline is within 3 days). However I have observed that this
> issue occurs not only for the "larger" datasets (6.8MB), but for all
> datasets and all jobs in general. However for smaller datasets (1MB) the AM
> does not throw the Exception, only containers throw exceptions (same as in
> previous e-mail). When these exception are throws my code (AM and
> containers) does not perform any operations on HDFS, they only perform
> in-memory computation and communication. Also I have observed that these
> exception occur at "random", I couldn't observe any pattern. I can execute
> job successfully, then resubmit the job repeating the experiment and these
> exceptions occur (no change was made to src code, input dataset,or
> execution/input parameters).
>
> As for the high network usage, as I said I don't have the data. But YARN
> is running on nodes which are exclusive for my experiments no other
> software runs on these nodes (only OS and YARN). Besides I don't think that
> 20 containers working on 1MB dataset (total) can be called high network
> usage.
>
> regards
> tmp
>
>
>
> 2013/6/26 Devaraj k <devaraj.k@huawei.com>
>
>>  Hi,****
>>
>> ** **
>>
>>    Could you check the network usage in the cluster when this problem
>> occurs? Probably it is causing due to high network usage. ****
>>
>> ** **
>>
>> Thanks****
>>
>> Devaraj k****
>>
>> ** **
>>
>> *From:* blah blah [mailto:tmp5330@gmail.com]
>> *Sent:* 26 June 2013 05:39
>> *To:* user@hadoop.apache.org
>> *Subject:* Yarn HDFS and Yarn Exceptions when processing "larger"
>> datasets.****
>>
>> ** **
>>
>> Hi All****
>>
>> First let me excuse for the poor thread title but I have no idea how to
>> express the problem in one sentence. ****
>>
>> I have implemented new Application Master with the use of Yarn. I am
>> using old Yarn development version. Revision 1437315, from 2013-01-23
>> (SNAPSHOT 3.0.0). I can not update to current trunk version, as prototype
>> deadline is soon, and I don't have time to include Yarn API changes.****
>>
>> Currently I execute experiments in pseudo-distributed mode, I use guava
>> version 14.0-rc1. I have a problem with Yarn's and HDFS Exceptions for
>> "larger" datasets. My AM works fine and I can execute it without a problem
>> for a debug dataset (1MB size). But when I increase the size of input to
>> 6.8 MB, I am getting the following exceptions:****
>>
>> AM_Exceptions_Stack
>>
>> Exception in thread "Thread-3"
>> java.lang.reflect.UndeclaredThrowableException
>>     at
>> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
>>     at
>> org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.allocate(AMRMProtocolPBClientImpl.java:77)
>>     at
>> org.apache.hadoop.yarn.client.AMRMClientImpl.allocate(AMRMClientImpl.java:194)
>>     at
>> org.tudelft.ludograph.app.AppMasterContainerRequester.sendContainerAskToRM(AppMasterContainerRequester.java:219)
>>     at
>> org.tudelft.ludograph.app.AppMasterContainerRequester.run(AppMasterContainerRequester.java:315)
>>     at java.lang.Thread.run(Thread.java:662)
>> Caused by: com.google.protobuf.ServiceException: java.io.IOException:
>> Failed on local exception: java.io.IOException: Response is null.; Host
>> Details : local host is: "linux-ljc5.site/127.0.0.1"; destination host
>> is: "0.0.0.0":8030;
>>     at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:212)
>>     at $Proxy10.allocate(Unknown Source)
>>     at
>> org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.allocate(AMRMProtocolPBClientImpl.java:75)
>>     ... 4 more
>> Caused by: java.io.IOException: Failed on local exception:
>> java.io.IOException: Response is null.; Host Details : local host is:
>> "linux-ljc5.site/127.0.0.1"; destination host is: "0.0.0.0":8030;
>>     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:760)
>>     at org.apache.hadoop.ipc.Client.call(Client.java:1240)
>>     at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>>     ... 6 more
>> Caused by: java.io.IOException: Response is null.
>>     at
>> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:950)
>>     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)****
>>
>> Container_Exception
>>
>> Exception in thread "org.apache.hadoop.hdfs.SocketCache@6da0d866"
>> java.lang.NoSuchMethodError:
>> com.google.common.collect.LinkedListMultimap.values()Ljava/util/List;
>>     at org.apache.hadoop.hdfs.SocketCache.clear(SocketCache.java:257)
>>     at org.apache.hadoop.hdfs.SocketCache.access$100(SocketCache.java:45)
>>     at org.apache.hadoop.hdfs.SocketCache$1.run(SocketCache.java:126)
>>     at java.lang.Thread.run(Thread.java:662)
>>
>> ****
>>
>> As I said this problem does not occur for the 1MB input. For the 6MB
>> input nothing is changed except the input dataset. Now a little bit of what
>> am I doing, to give you the context of the problem. My AM starts N (debug
>> 4) containers and each container reads its input data part. When this
>> process is finished I am exchanging parts of input between containers
>> (exchanging IDs of input structures, to provide means for communication
>> between data structures). During the process of exchanging IDs these
>> exceptions occur. I start Netty Server/Client on each container and I use
>> ports 12000-12099 as mean of communicating these IDs. ****
>>
>> Any help will be greatly appreciated. Sorry for any typos and if the
>> explanation is not clear just ask for any details you are interested in.
>> Currently it is after 2 AM I hope this will be a valid excuse.****
>>
>> regards****
>>
>> tmp****
>>
>
>

--089e0158b20266969404e07744cf
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div style>As I don&#39;t know your comp=
lete AM code and how your containers are communicating with each other...Ce=
rtain things which might help you in debugging.... where you are starting y=
our RM (is it really running on 8030???? are you sure there is no previousl=
y started RM still running there?) Also in yarn-site.xml can you try changi=
ng RM address to something like &quot;localhost:&lt;free-port-but-not-defau=
lt&gt;&quot; and configure maximum client thread size for handling AM reque=
sts? only your AM is expected to communicate with RM on AM-RM protocol.. by=
 any chance in your code; are containers directly communicating with RM on =
AM-RM protocol??</div>
<div style><br></div><div style>


<p class=3D""><span class=3D"">=A0=A0</span><span class=3D"">&lt;</span>pro=
perty<span class=3D"">&gt;</span></p>
<p class=3D"">=A0 =A0 <span class=3D"">&lt;</span><span class=3D"">descript=
ion</span><span class=3D"">&gt;</span>The address of the scheduler interfac=
e.<span class=3D"">&lt;/</span><span class=3D"">description</span><span cla=
ss=3D"">&gt;</span></p>

<p class=3D"">=A0 =A0 <span class=3D"">&lt;</span><span class=3D"">name</sp=
an><span class=3D"">&gt;</span>yarn.resourcemanager.scheduler.address<span =
class=3D"">&lt;/</span><span class=3D"">name</span><span class=3D"">&gt;</s=
pan></p>
<p class=3D"">=A0 =A0 <span class=3D"">&lt;</span><span class=3D"">value</s=
pan><span class=3D"">&gt;</span>${yarn.resourcemanager.hostname}:8030<span =
class=3D"">&lt;/</span><span class=3D"">value</span><span class=3D"">&gt;</=
span></p>
<p class=3D""><span class=3D"">=A0 </span><span class=3D"">&lt;/</span>prop=
erty<span class=3D"">&gt;</span></p>
<p class=3D""><br></p>
<p class=3D""><span class=3D"">=A0 </span><span class=3D"">&lt;</span>prope=
rty<span class=3D"">&gt;</span></p>
<p class=3D"">=A0 =A0 <span class=3D"">&lt;</span><span class=3D"">descript=
ion</span><span class=3D"">&gt;</span>Number of threads to handle scheduler=
 interface.<span class=3D"">&lt;/</span><span class=3D"">description</span>=
<span class=3D"">&gt;</span></p>

<p class=3D"">=A0 =A0 <span class=3D"">&lt;</span><span class=3D"">name</sp=
an><span class=3D"">&gt;</span>yarn.resourcemanager.scheduler.client.thread=
-count<span class=3D"">&lt;/</span><span class=3D"">name</span><span class=
=3D"">&gt;</span></p>

<p class=3D""><span class=3D"">=A0 =A0 </span><span class=3D"">&lt;</span>v=
alue<span class=3D"">&gt;</span><span class=3D"">50</span><span class=3D"">=
&lt;/</span>value<span class=3D"">&gt;</span></p>
<p class=3D""><span class=3D"">=A0 </span><span class=3D"">&lt;/</span>prop=
erty<span class=3D"">&gt;</span></p></div><div style><br></div><div class=
=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr"><font face=3D"cour=
ier new, monospace">Thanks,</font><div>
<font face=3D"courier new, monospace">Omkar Joshi</font></div><div><font fa=
ce=3D"courier new, monospace"><a href=3D"http://www.hortonworks.com" target=
=3D"_blank"><b>Hortonworks Inc.</b></a></font></div></div></div>
<br><br><div class=3D"gmail_quote">On Fri, Jun 28, 2013 at 5:35 AM, blah bl=
ah <span dir=3D"ltr">&lt;<a href=3D"mailto:tmp5330@gmail.com" target=3D"_bl=
ank">tmp5330@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-=
color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div dir=3D"ltr"><div><div><div>Hi <br><br>Sorry to reply so late. I don=
9;t have the data you requested (sorry I have no time, my deadline is withi=
n 3 days). However I have observed that this issue occurs not only for the =
&quot;larger&quot; datasets (6.8MB), but for all datasets and all jobs in g=
eneral. However for smaller datasets (1MB) the AM does not throw the Except=
ion, only containers throw exceptions (same as in previous e-mail). When th=
ese exception are throws my code (AM and containers) does not perform any o=
perations on HDFS, they only perform in-memory computation and communicatio=
n. Also I have observed that these exception occur at &quot;random&quot;, I=
 couldn&#39;t observe any pattern. I can execute job successfully, then res=
ubmit the job repeating the experiment and these exceptions occur (no chang=
e was made to src code, input dataset,or execution/input parameters).<br>

<br></div>As for the high network usage, as I said I don&#39;t have the dat=
a. But YARN is running on nodes which are exclusive for my experiments no o=
ther software runs on these nodes (only OS and YARN). Besides I don&#39;t t=
hink that 20 containers working on 1MB dataset (total) can be called high n=
etwork usage.<br>

<br></div>regards<br></div>tmp<br><div><div><br></div></div></div><div clas=
s=3D""><div class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"g=
mail_quote">2013/6/26 Devaraj k <span dir=3D"ltr">&lt;<a href=3D"mailto:dev=
araj.k@huawei.com" target=3D"_blank">devaraj.k@huawei.com</a>&gt;</span><br=
>

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">


<div link=3D"blue" vlink=3D"purple" lang=3D"EN-US">
<div>
<p class=3D""><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;=
color:rgb(31,73,125)">Hi,<u></u><u></u></span></p>
<p class=3D""><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;=
color:rgb(31,73,125)"><u></u>=A0<u></u></span></p>
<p class=3D""><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;=
color:rgb(31,73,125)">=A0=A0 Could you check the network usage in the clust=
er when this problem occurs? Probably it is causing due to high network usa=
ge.
<u></u><u></u></span></p>
<p class=3D""><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;=
color:rgb(31,73,125)"><u></u>=A0<u></u></span></p>
<p class=3D""><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;=
color:rgb(54,95,145)">Thanks<u></u><u></u></span></p>
<p class=3D""><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;=
color:rgb(54,95,145)">Devaraj k<u></u><u></u></span></p>
<p class=3D""><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;=
color:rgb(31,73,125)"><u></u>=A0<u></u></span></p>
<div style=3D"border-style:solid none none;border-top-color:rgb(181,196,223=
);border-top-width:1pt;padding:3pt 0cm 0cm">
<p class=3D""><b><span style=3D"font-size:10pt;font-family:Tahoma,sans-seri=
f">From:</span></b><span style=3D"font-size:10pt;font-family:Tahoma,sans-se=
rif"> blah blah [mailto:<a href=3D"mailto:tmp5330@gmail.com" target=3D"_bla=
nk">tmp5330@gmail.com</a>]
<br>
<b>Sent:</b> 26 June 2013 05:39<br>
<b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user=
@hadoop.apache.org</a><br>
<b>Subject:</b> Yarn HDFS and Yarn Exceptions when processing &quot;larger&=
quot; datasets.<u></u><u></u></span></p>
</div><div><div>
<p class=3D""><u></u>=A0<u></u></p>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<p class=3D"" style=3D"margin-bottom:12pt">Hi All<u></u><u></u></p>
</div>
<p class=3D"" style=3D"margin-bottom:12pt">First let me excuse for the poor=
 thread title but I have no idea how to express the problem in one sentence=
.
<u></u><u></u></p>
</div>
<p class=3D"" style=3D"margin-bottom:12pt">I have implemented new Applicati=
on Master with the use of Yarn. I am using old Yarn development version. Re=
vision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I can not update to curre=
nt trunk version, as prototype
 deadline is soon, and I don&#39;t have time to include Yarn API changes.<u=
></u><u></u></p>
</div>
<p class=3D"" style=3D"margin-bottom:12pt">Currently I execute experiments =
in pseudo-distributed mode, I use guava version 14.0-rc1. I have a problem =
with Yarn&#39;s and HDFS Exceptions for &quot;larger&quot; datasets. My AM =
works fine and I can execute it without
 a problem for a debug dataset (1MB size). But when I increase the size of =
input to 6.8 MB, I am getting the following exceptions:<u></u><u></u></p>
</div>
<p class=3D"" style=3D"margin-bottom:12pt">AM_Exceptions_Stack<br>
<br>
Exception in thread &quot;Thread-3&quot; java.lang.reflect.UndeclaredThrowa=
bleException<br>
=A0=A0=A0 at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionP=
BImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)<br>
=A0=A0=A0 at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClient=
Impl.allocate(AMRMProtocolPBClientImpl.java:77)<br>
=A0=A0=A0 at org.apache.hadoop.yarn.client.AMRMClientImpl.allocate(AMRMClie=
ntImpl.java:194)<br>
=A0=A0=A0 at org.tudelft.ludograph.app.AppMasterContainerRequester.sendCont=
ainerAskToRM(AppMasterContainerRequester.java:219)<br>
=A0=A0=A0 at org.tudelft.ludograph.app.AppMasterContainerRequester.run(AppM=
asterContainerRequester.java:315)<br>
=A0=A0=A0 at java.lang.Thread.run(Thread.java:662)<br>
Caused by: com.google.protobuf.ServiceException: java.io.IOException: Faile=
d on local exception: java.io.IOException: Response is null.; Host Details =
: local host is: &quot;linux-ljc5.site/<a href=3D"http://127.0.0.1" target=
=3D"_blank">127.0.0.1</a>&quot;; destination host is: &quot;0.0.0.0&quot;:8=
030;
<br>
=A0=A0=A0 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Protobu=
fRpcEngine.java:212)<br>
=A0=A0=A0 at $Proxy10.allocate(Unknown Source)<br>
=A0=A0=A0 at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClient=
Impl.allocate(AMRMProtocolPBClientImpl.java:75)<br>
=A0=A0=A0 ... 4 more<br>
Caused by: java.io.IOException: Failed on local exception: java.io.IOExcept=
ion: Response is null.; Host Details : local host is: &quot;linux-ljc5.site=
/<a href=3D"http://127.0.0.1" target=3D"_blank">127.0.0.1</a>&quot;; destin=
ation host is: &quot;0.0.0.0&quot;:8030;
<br>
=A0=A0=A0 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:760=
)<br>
=A0=A0=A0 at org.apache.hadoop.ipc.Client.call(Client.java:1240)<br>
=A0=A0=A0 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Protobu=
fRpcEngine.java:202)<br>
=A0=A0=A0 ... 6 more<br>
Caused by: java.io.IOException: Response is null.<br>
=A0=A0=A0 at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Cli=
ent.java:950)<br>
=A0=A0=A0 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)<u=
></u><u></u></p>
</div>
<p class=3D"" style=3D"margin-bottom:12pt">Container_Exception<br>
<br>
Exception in thread &quot;<a href=3D"mailto:org.apache.hadoop.hdfs.SocketCa=
che@6da0d866" target=3D"_blank">org.apache.hadoop.hdfs.SocketCache@6da0d866=
</a>&quot; java.lang.NoSuchMethodError: com.google.common.collect.LinkedLis=
tMultimap.values()Ljava/util/List;<br>


=A0=A0=A0 at org.apache.hadoop.hdfs.SocketCache.clear(SocketCache.java:257)=
<br>
=A0=A0=A0 at org.apache.hadoop.hdfs.SocketCache.access$100(SocketCache.java=
:45)<br>
=A0=A0=A0 at org.apache.hadoop.hdfs.SocketCache$1.run(SocketCache.java:126)=
<br>
=A0=A0=A0 at java.lang.Thread.run(Thread.java:662)<br>
<br>
<u></u><u></u></p>
</div>
<p class=3D"" style=3D"margin-bottom:12pt">As I said this problem does not =
occur for the 1MB input. For the 6MB input nothing is changed except the in=
put dataset. Now a little bit of what am I doing, to give you the context o=
f the problem. My AM starts
 N (debug 4) containers and each container reads its input data part. When =
this process is finished I am exchanging parts of input between containers =
(exchanging IDs of input structures, to provide means for communication bet=
ween data structures). During the
 process of exchanging IDs these exceptions occur. I start Netty Server/Cli=
ent on each container and I use ports 12000-12099 as mean of communicating =
these IDs.
<u></u><u></u></p>
</div>
<p class=3D"" style=3D"margin-bottom:12pt">Any help will be greatly appreci=
ated. Sorry for any typos and if the explanation is not clear just ask for =
any details you are interested in. Currently it is after 2 AM I hope this w=
ill be a valid excuse.<u></u><u></u></p>


</div>
<p class=3D"">regards<u></u><u></u></p>
</div>
<p class=3D"">tmp<u></u><u></u></p>
</div>
</div></div></div>
</div>

</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>

--089e0158b20266969404e07744cf--