Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B432E17488 for ; Wed, 23 Sep 2015 18:14:31 +0000 (UTC) Received: (qmail 46101 invoked by uid 500); 23 Sep 2015 18:14:26 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 45975 invoked by uid 500); 23 Sep 2015 18:14:26 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 45965 invoked by uid 99); 23 Sep 2015 18:14:26 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Sep 2015 18:14:26 +0000 Received: from [192.168.2.2] (unknown [122.172.31.186]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id D79F91A025E for ; Wed, 23 Sep 2015 18:14:24 +0000 (UTC) User-Agent: Microsoft-MacOutlook/0.0.0.150911 Date: Wed, 23 Sep 2015 23:39:10 +0530 Subject: Re: node remains unused after reboot From: Varun Vasudev To: Message-ID: <0F708102-A671-44ED-BEF7-C5BBE796572B@apache.org> Thread-Topic: node remains unused after reboot References: <36A18766-3FBD-464B-908E-2965F7768607@gmail.com> <568D9948-B210-4014-82A5-12240560CBD8@gmail.com> In-Reply-To: <568D9948-B210-4014-82A5-12240560CBD8@gmail.com> Mime-version: 1.0 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: quoted-printable Hi Dmitry, Did you check the MR AM logs to see if the node was blacklisted for too man= y container failures? -Varun On 9/23/15, 12:26 PM, "Dmitry Sivachenko" wrote: > >> On 23 =D1=81=D0=B5=D0=BD=D1=82. 2015 =D0=B3., at 7:02, Naganarasimha G R (Naga) wrote: >>=20 >> Hi Dmitry, >> Seems to be an interesting case, would like some more clarifications in = this regard : >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have sam= e resource capacity ? by 3000 cores if same config then expecting around 100= nodes, am i correct ? > > >I have 1 NN (and 1 SNN). >To be precise, I have 113 32-core machines assigned to run jobs (113*32=3D36= 16 total VCores) > > >> 2. How many applications are running and how many have got finished (bas= ically available in RM) ? By 35000 you mean finished and running application= s ? > >There were 1 application running at that time (with 35000 map tasks) > > >> 3. Weather after some time, tasks are getting assigned ? Also is it only= this host not getting assigned or no other host also gets any containers as= signed ? > > >This machine were excluded from running tasks for that job. It got tasks = assigned after almost 1.5 hours when first job (during which machine failed)= was finished and next job was started, see timestampts: > > > >2015-09-23 01:06:24,656 INFO [main] nodemanager.NodeStatusUpdaterImpl (No= deStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager t= o unblock new container-requests >2015-09-23 02:29:33,301 INFO [Socket Reader #1 for port 10007] ipc.Server= (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341= 485_1975_000001 (auth:SIMPLE) > > >Previous job (during which that node rebooted) did not run more tasks on t= his host. > > >>=20 >> I suspect this issue might be similar to YARN-3990, hence the above ques= tions. Further you can check the RM logs and inform weather you see some sim= ilar logs as below >> 2015-07-29 19:39:03,416 | INFO | AsyncDispatcher event handler | Size o= f event-queue is 14000 | AsyncDispatcher.java:235 >> 2015-07-29 19:39:03,417 | INFO | AsyncDispatcher event handler | Size o= f event-queue is 15000 | AsyncDispatcher.java:235 > > >There were 2 of these: >2015-09-23 00:54:39,623 INFO [AsyncDispatcher event handler] event.AsyncD= ispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000 >2015-09-23 01:06:24,623 INFO [AsyncDispatcher event handler] event.AsyncD= ispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000 > > >What does these mean? > > >>=20 >>=20 >> Regards, >> + Naga >>=20 >>=20 >> From: Dmitry Sivachenko [trtrmitya@gmail.com] >> Sent: Wednesday, September 23, 2015 03:57 >> To: user@hadoop.apache.org >> Subject: node remains unused after reboot >>=20 >> Hello! >>=20 >> I am using hadoop-2.7.1. I have a large map job running (total cores ava= ilable on the cluster about 3000, total tasks 35000). >> In the middle of this process one server reboots. >>=20 >> After reboot, nodemanager starts successfully end registers with resourc= e manager: >> 2015-09-23 01:06:24,656 INFO [main] nodemanager.NodeStatusUpdaterImpl (= NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager= to unblock new container-requests >>=20 >> In YARN web-interface I see this host as active, but VCores used remains= zero (see screenshot). >> But the map job mentioned is still running and have about 12000 pending = tasks. >>=20 >> Why this host does not receive tasks to run? >>=20 >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem w= ith 2.4.1: new tasks were spawning immediately after reboot. >>=20 >> Thanks! >>=20 >>=20 >>=20 >>=20 >> >