hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Vasudev <vvasu...@apache.org>
Subject Re: node remains unused after reboot
Date Wed, 23 Sep 2015 18:09:10 GMT
Hi Dmitry,

Did you check the MR AM logs to see if the node was blacklisted for too many container failures?

-Varun



On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <trtrmitya@gmail.com> wrote:

>
>> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <garlanaganarasimha@huawei.com>
wrote:
>> 
>> Hi Dmitry,
>> Seems to be an interesting case, would like some more clarifications in this regard
:
>> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource
capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
>
>
>I have 1 NN (and 1 SNN).
>To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)
>
>
>> 2. How many applications are running and how many have got finished (basically available
in RM) ? By 35000 you mean finished and running applications ?
>
>There were 1 application running at that time (with 35000 map tasks)
>
>
>> 3. Weather after some time, tasks are getting assigned ? Also is it only this host
not getting assigned or no other host also gets any containers assigned ?
>
>
>This machine were excluded from running tasks for that job.  It got tasks assigned after
almost 1.5 hours when first job (during which machine failed) was finished and next job was
started, see timestampts:
>
>
>
>2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311))
- Notifying ContainerManager to unblock new container-requests
>2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316))
- Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)
>
>
>Previous job (during which that node rebooted) did not run more tasks on this host.
>
>
>> 
>> I suspect this issue might be similar to YARN-3990, hence the above questions. Further
you can check the RM logs and inform weather you see some similar logs as below
>> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue
is 14000 | AsyncDispatcher.java:235
>> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue
is 15000 | AsyncDispatcher.java:235
>
>
>There were 2 of these:
>2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235))
- Size of event-queue is 1000
>2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235))
- Size of event-queue is 1000
>
>
>What does these mean?
>
>
>> 
>> 
>> Regards,
>> + Naga
>> 
>> 
>> From: Dmitry Sivachenko [trtrmitya@gmail.com]
>> Sent: Wednesday, September 23, 2015 03:57
>> To: user@hadoop.apache.org
>> Subject: node remains unused after reboot
>> 
>> Hello!
>> 
>> I am using hadoop-2.7.1. I have a large map job running (total cores available on
the cluster about 3000, total tasks 35000).
>> In the middle of this process one server reboots.
>> 
>> After reboot, nodemanager starts successfully end registers with resource manager:
>> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311))
- Notifying ContainerManager to unblock new container-requests
>> 
>> In YARN web-interface I see this host as active, but VCores used remains zero (see
screenshot).
>> But the map job mentioned is still running and have about 12000 pending tasks.
>> 
>> Why this host does not receive tasks to run?
>> 
>> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1:
new tasks were spawning immediately after reboot.
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> <Screen Shot 2015-09-23 at 1.22.10.png>
>


Mime
View raw message