Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
User-Agent: Microsoft-MacOutlook/0.0.0.150911
Date: Wed, 23 Sep 2015 23:39:10 +0530
Subject: Re: node remains unused after reboot
From: Varun Vasudev <vvasudev@apache.org>
To: <user@hadoop.apache.org>
Message-ID: <0F708102-A671-44ED-BEF7-C5BBE796572B@apache.org>
Thread-Topic: node remains unused after reboot
References: <36A18766-3FBD-464B-908E-2965F7768607@gmail.com>
 <AD354F56741A1B47882A625909A59C692BE312DA@SZXEML505-MBX.china.huawei.com>
 <568D9948-B210-4014-82A5-12240560CBD8@gmail.com>
In-Reply-To: <568D9948-B210-4014-82A5-12240560CBD8@gmail.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="UTF-8"
Content-transfer-encoding: quoted-printable

Hi Dmitry,

Did you check the MR AM logs to see if the node was blacklisted for too man=
y container failures?

-Varun


On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <trtrmitya@gmail.com> wrote:

>
>> On 23 =D1=81=D0=B5=D0=BD=D1=82. 2015 =D0=B3., at 7:02, Naganarasimha G R (Naga) <garlanagana=
rasimha@huawei.com> wrote:
>>=20
>> Hi Dmitry,
>> Seems to be an interesting case, would like some more clarifications in =
this regard :
>> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have sam=
e resource capacity ? by 3000 cores if same config then expecting around 100=
 nodes, am i correct ?
>
>
>I have 1 NN (and 1 SNN).
>To be precise, I have 113 32-core machines assigned to run jobs (113*32=3D36=
16 total VCores)
>
>
>> 2. How many applications are running and how many have got finished (bas=
ically available in RM) ? By 35000 you mean finished and running application=
s ?
>
>There were 1 application running at that time (with 35000 map tasks)
>
>
>> 3. Weather after some time, tasks are getting assigned ? Also is it only=
 this host not getting assigned or no other host also gets any containers as=
signed ?
>
>
>This machine were excluded from running tasks for that job.  It got tasks =
assigned after almost 1.5 hours when first job (during which machine failed)=
 was finished and next job was started, see timestampts:
>
>
>
>2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (No=
deStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager t=
o unblock new container-requests
>2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server=
 (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341=
485_1975_000001 (auth:SIMPLE)
>
>
>Previous job (during which that node rebooted) did not run more tasks on t=
his host.
>
>
>>=20
>> I suspect this issue might be similar to YARN-3990, hence the above ques=
tions. Further you can check the RM logs and inform weather you see some sim=
ilar logs as below
>> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size o=
f event-queue is 14000 | AsyncDispatcher.java:235
>> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size o=
f event-queue is 15000 | AsyncDispatcher.java:235
>
>
>There were 2 of these:
>2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncD=
ispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncD=
ispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>
>
>What does these mean?
>
>
>>=20
>>=20
>> Regards,
>> + Naga
>>=20
>>=20
>> From: Dmitry Sivachenko [trtrmitya@gmail.com]
>> Sent: Wednesday, September 23, 2015 03:57
>> To: user@hadoop.apache.org
>> Subject: node remains unused after reboot
>>=20
>> Hello!
>>=20
>> I am using hadoop-2.7.1. I have a large map job running (total cores ava=
ilable on the cluster about 3000, total tasks 35000).
>> In the middle of this process one server reboots.
>>=20
>> After reboot, nodemanager starts successfully end registers with resourc=
e manager:
>> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (=
NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager=
 to unblock new container-requests
>>=20
>> In YARN web-interface I see this host as active, but VCores used remains=
 zero (see screenshot).
>> But the map job mentioned is still running and have about 12000 pending =
tasks.
>>=20
>> Why this host does not receive tasks to run?
>>=20
>> PS: I recently upgraded from 2.4.1 and I did not notice such a problem w=
ith 2.4.1: new tasks were spawning immediately after reboot.
>>=20
>> Thanks!
>>=20
>>=20
>>=20
>>=20
>> <Screen Shot 2015-09-23 at 1.22.10.png>
>