Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: <0F708102-A671-44ED-BEF7-C5BBE796572B@apache.org>
References: <36A18766-3FBD-464B-908E-2965F7768607@gmail.com>
	<AD354F56741A1B47882A625909A59C692BE312DA@SZXEML505-MBX.china.huawei.com>
	<568D9948-B210-4014-82A5-12240560CBD8@gmail.com>
	<0F708102-A671-44ED-BEF7-C5BBE796572B@apache.org>
Date: Thu, 24 Sep 2015 00:38:19 +0530
Message-ID: 
 <CAAntRU+VrEVPcBf2LxpVSPhD913dM2d25o9BquqfNcbkUnT72Q@mail.gmail.com>
Subject: Re: node remains unused after reboot
From: Naganarasimha Garla <naganarasimha.gr@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a114a429a6bd95405206ed632

--001a114a429a6bd95405206ed632
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Sorry for the late Reply, thought of providing you some search strings for
blackListing hence got lil delayed.
As varun mentioned it looks more like app blacklisting case.
mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so
probability as per the scenario mentioned by you is that the node is
getting black listed.
You can search for Info logs with string as "*Blacklisted host <host>*"
from RMContainerRequestor class.

*What does these mean?*
As per the defect in *YARN-3990, *if there are more events clogged (got
from the logs as *Size of event-queue is 14000*) then there is possibility
that events are getting delayed and hence there is delay in assignment but
as per descriptions shared by you, it seems like not this case. But how
many finished applications were there ?  more nodes and more
apps(finished/running) can cause this.

+ Naga

On Wed, Sep 23, 2015 at 11:39 PM, Varun Vasudev <vvasudev@apache.org> wrote=
:

> Hi Dmitry,
>
> Did you check the MR AM logs to see if the node was blacklisted for too
> many container failures?
>
> -Varun
>
>
>
> On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <trtrmitya@gmail.com> wrote:
>
> >
> >> On 23 =D1=81=D0=B5=D0=BD=D1=82. 2015 =D0=B3., at 7:02, Naganarasimha G=
 R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
> >>
> >> Hi Dmitry,
> >> Seems to be an interesting case, would like some more clarifications i=
n
> this regard :
> >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have
> same resource capacity ? by 3000 cores if same config then expecting arou=
nd
> 100 nodes, am i correct ?
> >
> >
> >I have 1 NN (and 1 SNN).
> >To be precise, I have 113 32-core machines assigned to run jobs
> (113*32=3D3616 total VCores)
> >
> >
> >> 2. How many applications are running and how many have got finished
> (basically available in RM) ? By 35000 you mean finished and running
> applications ?
> >
> >There were 1 application running at that time (with 35000 map tasks)
> >
> >
> >> 3. Weather after some time, tasks are getting assigned ? Also is it
> only this host not getting assigned or no other host also gets any
> containers assigned ?
> >
> >
> >This machine were excluded from running tasks for that job.  It got task=
s
> assigned after almost 1.5 hours when first job (during which machine
> failed) was finished and next job was started, see timestampts:
> >
> >
> >
> >2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007]
> ipc.Server (Server.java:saslProcess(1316)) - Auth successful for
> appattempt_1441808341485_1975_000001 (auth:SIMPLE)
> >
> >
> >Previous job (during which that node rebooted) did not run more tasks on
> this host.
> >
> >
> >>
> >> I suspect this issue might be similar to YARN-3990, hence the above
> questions. Further you can check the RM logs and inform weather you see
> some similar logs as below
> >> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 14000 | AsyncDispatcher.java:235
> >> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 15000 | AsyncDispatcher.java:235
> >
> >
> >There were 2 of these:
> >2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >
> >
> >What does these mean?
> >
> >
> >>
> >>
> >> Regards,
> >> + Naga
> >>
> >>
> >> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> >> Sent: Wednesday, September 23, 2015 03:57
> >> To: user@hadoop.apache.org
> >> Subject: node remains unused after reboot
> >>
> >> Hello!
> >>
> >> I am using hadoop-2.7.1. I have a large map job running (total cores
> available on the cluster about 3000, total tasks 35000).
> >> In the middle of this process one server reboots.
> >>
> >> After reboot, nodemanager starts successfully end registers with
> resource manager:
> >> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >>
> >> In YARN web-interface I see this host as active, but VCores used
> remains zero (see screenshot).
> >> But the map job mentioned is still running and have about 12000 pendin=
g
> tasks.
> >>
> >> Why this host does not receive tasks to run?
> >>
> >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem
> with 2.4.1: new tasks were spawning immediately after reboot.
> >>
> >> Thanks!
> >>
> >>
> >>
> >>
> >> <Screen Shot 2015-09-23 at 1.22.10.png>
> >
>
>

--001a114a429a6bd95405206ed632
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Sorry for the late Reply, thought of providing you some se=
arch strings=C2=A0for blackListing=C2=A0hence got lil delayed.<div>As varun=
 mentioned it looks more=C2=A0like app blacklisting case. =C2=A0<a name=3D"=
mapreduce.job.maxtaskfailures.per.tracker" style=3D"color:rgb(0,75,145);fon=
t-family:&#39;Times New Roman&#39;;font-size:medium">mapreduce.job.maxtaskf=
ailures.per.tracker=C2=A0</a>which is by default 3, so probability as per t=
he scenario mentioned by you is that the node is getting black listed.</div=
><div>You can search for Info logs with string as &quot;<i>Blacklisted host=
 &lt;host&gt;</i>&quot; from=C2=A0RMContainerRequestor class.</div><div><br=
></div><div><span style=3D"font-size:12.8px"><i>What does these mean?</i></=
span><br></div><div>As per the defect in=C2=A0<span style=3D"color:rgb(80,0=
,80);font-size:12.8px"><b>YARN-3990,=C2=A0</b></span>if there are more even=
ts clogged (got from the logs as=C2=A0<span style=3D"color:rgb(80,0,80);fon=
t-size:12.8px"><b>Size of event-queue is 14000</b></span>) then there is po=
ssibility that events are getting delayed and hence there is delay in assig=
nment but as per descriptions shared by you, it seems like not this case. B=
ut how many finished applications were there ? =C2=A0more nodes and more ap=
ps(finished/running) can cause this.</div><div><br></div><div>+ Naga</div><=
/div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Sep =
23, 2015 at 11:39 PM, Varun Vasudev <span dir=3D"ltr">&lt;<a href=3D"mailto=
:vvasudev@apache.org" target=3D"_blank">vvasudev@apache.org</a>&gt;</span> =
wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">Hi Dmitry,<br>
<br>
Did you check the MR AM logs to see if the node was blacklisted for too man=
y container failures?<br>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
-Varun<br>
</font></span><div class=3D"HOEnZb"><div class=3D"h5"><br>
<br>
<br>
On 9/23/15, 12:26 PM, &quot;Dmitry Sivachenko&quot; &lt;<a href=3D"mailto:t=
rtrmitya@gmail.com">trtrmitya@gmail.com</a>&gt; wrote:<br>
<br>
&gt;<br>
&gt;&gt; On 23 =D1=81=D0=B5=D0=BD=D1=82. 2015 =D0=B3., at 7:02, Naganarasim=
ha G R (Naga) &lt;<a href=3D"mailto:garlanaganarasimha@huawei.com">garlanag=
anarasimha@huawei.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; Hi Dmitry,<br>
&gt;&gt; Seems to be an interesting case, would like some more clarificatio=
ns in this regard :<br>
&gt;&gt; 1. How many NM&#39;s ? Is it a hetergenous cluster or all the node=
s have same resource capacity ? by 3000 cores if same config then expecting=
 around 100 nodes, am i correct ?<br>
&gt;<br>
&gt;<br>
&gt;I have 1 NN (and 1 SNN).<br>
&gt;To be precise, I have 113 32-core machines assigned to run jobs (113*32=
=3D3616 total VCores)<br>
&gt;<br>
&gt;<br>
&gt;&gt; 2. How many applications are running and how many have got finishe=
d (basically available in RM) ? By 35000 you mean finished and running appl=
ications ?<br>
&gt;<br>
&gt;There were 1 application running at that time (with 35000 map tasks)<br=
>
&gt;<br>
&gt;<br>
&gt;&gt; 3. Weather after some time, tasks are getting assigned ? Also is i=
t only this host not getting assigned or no other host also gets any contai=
ners assigned ?<br>
&gt;<br>
&gt;<br>
&gt;This machine were excluded from running tasks for that job.=C2=A0 It go=
t tasks assigned after almost 1.5 hours when first job (during which machin=
e failed) was finished and next job was started, see timestampts:<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;2015-09-23 01:06:24,656 INFO=C2=A0 [main] nodemanager.NodeStatusUpdater=
Impl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying Container=
Manager to unblock new container-requests<br>
&gt;2015-09-23 02:29:33,301 INFO=C2=A0 [Socket Reader #1 for port 10007] ip=
c.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1=
441808341485_1975_000001 (auth:SIMPLE)<br>
&gt;<br>
&gt;<br>
&gt;Previous job (during which that node rebooted) did not run more tasks o=
n this host.<br>
&gt;<br>
&gt;<br>
&gt;&gt;<br>
&gt;&gt; I suspect this issue might be similar to YARN-3990, hence the abov=
e questions. Further you can check the RM logs and inform weather you see s=
ome similar logs as below<br>
&gt;&gt; 2015-07-29 19:39:03,416 | INFO=C2=A0 | AsyncDispatcher event handl=
er | Size of event-queue is 14000 | AsyncDispatcher.java:235<br>
&gt;&gt; 2015-07-29 19:39:03,417 | INFO=C2=A0 | AsyncDispatcher event handl=
er | Size of event-queue is 15000 | AsyncDispatcher.java:235<br>
&gt;<br>
&gt;<br>
&gt;There were 2 of these:<br>
&gt;2015-09-23 00:54:39,623 INFO=C2=A0 [AsyncDispatcher event handler] even=
t.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue =
is 1000<br>
&gt;2015-09-23 01:06:24,623 INFO=C2=A0 [AsyncDispatcher event handler] even=
t.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue =
is 1000<br>
&gt;<br>
&gt;<br>
&gt;What does these mean?<br>
&gt;<br>
&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Regards,<br>
&gt;&gt; + Naga<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; From: Dmitry Sivachenko [<a href=3D"mailto:trtrmitya@gmail.com">tr=
trmitya@gmail.com</a>]<br>
&gt;&gt; Sent: Wednesday, September 23, 2015 03:57<br>
&gt;&gt; To: <a href=3D"mailto:user@hadoop.apache.org">user@hadoop.apache.o=
rg</a><br>
&gt;&gt; Subject: node remains unused after reboot<br>
&gt;&gt;<br>
&gt;&gt; Hello!<br>
&gt;&gt;<br>
&gt;&gt; I am using hadoop-2.7.1. I have a large map job running (total cor=
es available on the cluster about 3000, total tasks 35000).<br>
&gt;&gt; In the middle of this process one server reboots.<br>
&gt;&gt;<br>
&gt;&gt; After reboot, nodemanager starts successfully end registers with r=
esource manager:<br>
&gt;&gt; 2015-09-23 01:06:24,656 INFO=C2=A0 [main] nodemanager.NodeStatusUp=
daterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying Cont=
ainerManager to unblock new container-requests<br>
&gt;&gt;<br>
&gt;&gt; In YARN web-interface I see this host as active, but VCores used r=
emains zero (see screenshot).<br>
&gt;&gt; But the map job mentioned is still running and have about 12000 pe=
nding tasks.<br>
&gt;&gt;<br>
&gt;&gt; Why this host does not receive tasks to run?<br>
&gt;&gt;<br>
&gt;&gt; PS: I recently upgraded from 2.4.1 and I did not notice such a pro=
blem with 2.4.1: new tasks were spawning immediately after reboot.<br>
&gt;&gt;<br>
&gt;&gt; Thanks!<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; &lt;Screen Shot 2015-09-23 at 1.22.10.png&gt;<br>
&gt;<br>
<br>
</div></div></blockquote></div><br></div>

--001a114a429a6bd95405206ed632--