Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C644F17707 for ; Wed, 23 Sep 2015 19:08:42 +0000 (UTC) Received: (qmail 21322 invoked by uid 500); 23 Sep 2015 19:08:32 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 21227 invoked by uid 500); 23 Sep 2015 19:08:32 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 21215 invoked by uid 99); 23 Sep 2015 19:08:32 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Sep 2015 19:08:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F1255184EA4 for ; Wed, 23 Sep 2015 19:08:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.881 X-Spam-Level: *** X-Spam-Status: No, score=3.881 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 5Gfl4qvuTWlE for ; Wed, 23 Sep 2015 19:08:20 +0000 (UTC) Received: from mail-oi0-f54.google.com (mail-oi0-f54.google.com [209.85.218.54]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 09D6620F60 for ; Wed, 23 Sep 2015 19:08:20 +0000 (UTC) Received: by oixx17 with SMTP id x17so29485509oix.0 for ; Wed, 23 Sep 2015 12:08:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Cx2Rd6QJcEpvAU4TwJvfR24zhyJ9mo2Y+OX/ZT9rnj8=; b=IKPb25+bJM3eD1+HuModjUaciJM50FLYu1qqdELCPZEgZaZojJqlqR4rh6GIxgggHi ZTx0ZKEf4PP4RdrMdS7+KpmWbNdBjTZHtQ4z5Qbo70X5mKXADJcwbr2a7du8K+sWIDY+ ld0sy3l1DovQarMHxqX1rFZAsCIwP6ZFfiplsyFqgKjgG087WUIsZm826ZgEH7SRmlyZ xV2cJSzwMV8UGLCYS0MGoRA7AnehkBgAt5EB1BIfGChM/u/MN78Rgul2HxnVdNjL7YT6 Ji4bnJm6BeJ7HL/6NKT70FeRpdm7wqpWqsLXlDwGoTkORD9QqOo7Tszp58ZQ4k1Ej14Y kBoA== MIME-Version: 1.0 X-Received: by 10.202.242.198 with SMTP id q189mr19333734oih.23.1443035299434; Wed, 23 Sep 2015 12:08:19 -0700 (PDT) Received: by 10.202.62.87 with HTTP; Wed, 23 Sep 2015 12:08:19 -0700 (PDT) In-Reply-To: <0F708102-A671-44ED-BEF7-C5BBE796572B@apache.org> References: <36A18766-3FBD-464B-908E-2965F7768607@gmail.com> <568D9948-B210-4014-82A5-12240560CBD8@gmail.com> <0F708102-A671-44ED-BEF7-C5BBE796572B@apache.org> Date: Thu, 24 Sep 2015 00:38:19 +0530 Message-ID: Subject: Re: node remains unused after reboot From: Naganarasimha Garla To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a114a429a6bd95405206ed632 --001a114a429a6bd95405206ed632 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sorry for the late Reply, thought of providing you some search strings for blackListing hence got lil delayed. As varun mentioned it looks more like app blacklisting case. mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so probability as per the scenario mentioned by you is that the node is getting black listed. You can search for Info logs with string as "*Blacklisted host *" from RMContainerRequestor class. *What does these mean?* As per the defect in *YARN-3990, *if there are more events clogged (got from the logs as *Size of event-queue is 14000*) then there is possibility that events are getting delayed and hence there is delay in assignment but as per descriptions shared by you, it seems like not this case. But how many finished applications were there ? more nodes and more apps(finished/running) can cause this. + Naga On Wed, Sep 23, 2015 at 11:39 PM, Varun Vasudev wrote= : > Hi Dmitry, > > Did you check the MR AM logs to see if the node was blacklisted for too > many container failures? > > -Varun > > > > On 9/23/15, 12:26 PM, "Dmitry Sivachenko" wrote: > > > > >> On 23 =D1=81=D0=B5=D0=BD=D1=82. 2015 =D0=B3., at 7:02, Naganarasimha G= R (Naga) < > garlanaganarasimha@huawei.com> wrote: > >> > >> Hi Dmitry, > >> Seems to be an interesting case, would like some more clarifications i= n > this regard : > >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have > same resource capacity ? by 3000 cores if same config then expecting arou= nd > 100 nodes, am i correct ? > > > > > >I have 1 NN (and 1 SNN). > >To be precise, I have 113 32-core machines assigned to run jobs > (113*32=3D3616 total VCores) > > > > > >> 2. How many applications are running and how many have got finished > (basically available in RM) ? By 35000 you mean finished and running > applications ? > > > >There were 1 application running at that time (with 35000 map tasks) > > > > > >> 3. Weather after some time, tasks are getting assigned ? Also is it > only this host not getting assigned or no other host also gets any > containers assigned ? > > > > > >This machine were excluded from running tasks for that job. It got task= s > assigned after almost 1.5 hours when first job (during which machine > failed) was finished and next job was started, see timestampts: > > > > > > > >2015-09-23 01:06:24,656 INFO [main] nodemanager.NodeStatusUpdaterImpl > (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying > ContainerManager to unblock new container-requests > >2015-09-23 02:29:33,301 INFO [Socket Reader #1 for port 10007] > ipc.Server (Server.java:saslProcess(1316)) - Auth successful for > appattempt_1441808341485_1975_000001 (auth:SIMPLE) > > > > > >Previous job (during which that node rebooted) did not run more tasks on > this host. > > > > > >> > >> I suspect this issue might be similar to YARN-3990, hence the above > questions. Further you can check the RM logs and inform weather you see > some similar logs as below > >> 2015-07-29 19:39:03,416 | INFO | AsyncDispatcher event handler | Size > of event-queue is 14000 | AsyncDispatcher.java:235 > >> 2015-07-29 19:39:03,417 | INFO | AsyncDispatcher event handler | Size > of event-queue is 15000 | AsyncDispatcher.java:235 > > > > > >There were 2 of these: > >2015-09-23 00:54:39,623 INFO [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of > event-queue is 1000 > >2015-09-23 01:06:24,623 INFO [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of > event-queue is 1000 > > > > > >What does these mean? > > > > > >> > >> > >> Regards, > >> + Naga > >> > >> > >> From: Dmitry Sivachenko [trtrmitya@gmail.com] > >> Sent: Wednesday, September 23, 2015 03:57 > >> To: user@hadoop.apache.org > >> Subject: node remains unused after reboot > >> > >> Hello! > >> > >> I am using hadoop-2.7.1. I have a large map job running (total cores > available on the cluster about 3000, total tasks 35000). > >> In the middle of this process one server reboots. > >> > >> After reboot, nodemanager starts successfully end registers with > resource manager: > >> 2015-09-23 01:06:24,656 INFO [main] nodemanager.NodeStatusUpdaterImpl > (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying > ContainerManager to unblock new container-requests > >> > >> In YARN web-interface I see this host as active, but VCores used > remains zero (see screenshot). > >> But the map job mentioned is still running and have about 12000 pendin= g > tasks. > >> > >> Why this host does not receive tasks to run? > >> > >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem > with 2.4.1: new tasks were spawning immediately after reboot. > >> > >> Thanks! > >> > >> > >> > >> > >> > > > > --001a114a429a6bd95405206ed632 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Sorry for the late Reply, thought of providing you some se= arch strings=C2=A0for blackListing=C2=A0hence got lil delayed.
As varun= mentioned it looks more=C2=A0like app blacklisting case. =C2=A0mapreduce.job.maxtaskf= ailures.per.tracker=C2=A0which is by default 3, so probability as per t= he scenario mentioned by you is that the node is getting black listed.
You can search for Info logs with string as "Blacklisted host= <host>" from=C2=A0RMContainerRequestor class.
What does these mean?
As per the defect in=C2=A0YARN-3990,=C2=A0if there are more even= ts clogged (got from the logs as=C2=A0Size of event-queue is 14000) then there is po= ssibility that events are getting delayed and hence there is delay in assig= nment but as per descriptions shared by you, it seems like not this case. B= ut how many finished applications were there ? =C2=A0more nodes and more ap= ps(finished/running) can cause this.

+ Naga
<= /div>

On Wed, Sep = 23, 2015 at 11:39 PM, Varun Vasudev <vvasudev@apache.org> = wrote:
Hi Dmitry,

Did you check the MR AM logs to see if the node was blacklisted for too man= y container failures?

-Varun



On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <trtrmitya@gmail.com> wrote:

>
>> On 23 =D1=81=D0=B5=D0=BD=D1=82. 2015 =D0=B3., at 7:02, Naganarasim= ha G R (Naga) <garlanag= anarasimha@huawei.com> wrote:
>>
>> Hi Dmitry,
>> Seems to be an interesting case, would like some more clarificatio= ns in this regard :
>> 1. How many NM's ? Is it a hetergenous cluster or all the node= s have same resource capacity ? by 3000 cores if same config then expecting= around 100 nodes, am i correct ?
>
>
>I have 1 NN (and 1 SNN).
>To be precise, I have 113 32-core machines assigned to run jobs (113*32= =3D3616 total VCores)
>
>
>> 2. How many applications are running and how many have got finishe= d (basically available in RM) ? By 35000 you mean finished and running appl= ications ?
>
>There were 1 application running at that time (with 35000 map tasks) >
>
>> 3. Weather after some time, tasks are getting assigned ? Also is i= t only this host not getting assigned or no other host also gets any contai= ners assigned ?
>
>
>This machine were excluded from running tasks for that job.=C2=A0 It go= t tasks assigned after almost 1.5 hours when first job (during which machin= e failed) was finished and next job was started, see timestampts:
>
>
>
>2015-09-23 01:06:24,656 INFO=C2=A0 [main] nodemanager.NodeStatusUpdater= Impl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying Container= Manager to unblock new container-requests
>2015-09-23 02:29:33,301 INFO=C2=A0 [Socket Reader #1 for port 10007] ip= c.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1= 441808341485_1975_000001 (auth:SIMPLE)
>
>
>Previous job (during which that node rebooted) did not run more tasks o= n this host.
>
>
>>
>> I suspect this issue might be similar to YARN-3990, hence the abov= e questions. Further you can check the RM logs and inform weather you see s= ome similar logs as below
>> 2015-07-29 19:39:03,416 | INFO=C2=A0 | AsyncDispatcher event handl= er | Size of event-queue is 14000 | AsyncDispatcher.java:235
>> 2015-07-29 19:39:03,417 | INFO=C2=A0 | AsyncDispatcher event handl= er | Size of event-queue is 15000 | AsyncDispatcher.java:235
>
>
>There were 2 of these:
>2015-09-23 00:54:39,623 INFO=C2=A0 [AsyncDispatcher event handler] even= t.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue = is 1000
>2015-09-23 01:06:24,623 INFO=C2=A0 [AsyncDispatcher event handler] even= t.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue = is 1000
>
>
>What does these mean?
>
>
>>
>>
>> Regards,
>> + Naga
>>
>>
>> From: Dmitry Sivachenko [tr= trmitya@gmail.com]
>> Sent: Wednesday, September 23, 2015 03:57
>> To: user@hadoop.apache.o= rg
>> Subject: node remains unused after reboot
>>
>> Hello!
>>
>> I am using hadoop-2.7.1. I have a large map job running (total cor= es available on the cluster about 3000, total tasks 35000).
>> In the middle of this process one server reboots.
>>
>> After reboot, nodemanager starts successfully end registers with r= esource manager:
>> 2015-09-23 01:06:24,656 INFO=C2=A0 [main] nodemanager.NodeStatusUp= daterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying Cont= ainerManager to unblock new container-requests
>>
>> In YARN web-interface I see this host as active, but VCores used r= emains zero (see screenshot).
>> But the map job mentioned is still running and have about 12000 pe= nding tasks.
>>
>> Why this host does not receive tasks to run?
>>
>> PS: I recently upgraded from 2.4.1 and I did not notice such a pro= blem with 2.4.1: new tasks were spawning immediately after reboot.
>>
>> Thanks!
>>
>>
>>
>>
>> <Screen Shot 2015-09-23 at 1.22.10.png>
>


--001a114a429a6bd95405206ed632--