Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A9FE01086F for ; Fri, 14 Mar 2014 04:01:24 +0000 (UTC) Received: (qmail 85679 invoked by uid 500); 14 Mar 2014 04:01:15 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 85576 invoked by uid 500); 14 Mar 2014 04:01:15 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 85569 invoked by uid 99); 14 Mar 2014 04:01:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Mar 2014 04:01:14 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rohithsharmaks@huawei.com designates 119.145.14.66 as permitted sender) Received: from [119.145.14.66] (HELO szxga03-in.huawei.com) (119.145.14.66) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Mar 2014 04:01:10 +0000 Received: from 172.24.2.119 (EHLO szxeml206-edg.china.huawei.com) ([172.24.2.119]) by szxrg03-dlp.huawei.com (MOS 4.4.3-GA FastPath queued) with ESMTP id ALW40671; Fri, 14 Mar 2014 12:00:47 +0800 (CST) Received: from SZXEML420-HUB.china.huawei.com (10.82.67.159) by szxeml206-edg.china.huawei.com (172.24.2.59) with Microsoft SMTP Server (TLS) id 14.3.158.1; Fri, 14 Mar 2014 12:00:32 +0800 Received: from SZXEML512-MBS.china.huawei.com ([169.254.8.221]) by szxeml420-hub.china.huawei.com ([10.82.67.159]) with mapi id 14.03.0158.001; Fri, 14 Mar 2014 12:00:42 +0800 From: Rohith Sharma K S To: "user@hadoop.apache.org" Subject: RE: ResourceManager shutting down Thread-Topic: ResourceManager shutting down Thread-Index: Ac8+/cSdIIyymSLwQjeXKCd24t9+6wABYh0Q///flgD//3NQYA== Date: Fri, 14 Mar 2014 04:00:43 +0000 Message-ID: <0EE80F6F7A98A64EBD18F2BE839C9115675205AC@szxeml512-mbs.china.huawei.com> References: <869970D71E26D7498BDAC4E1CA92226B86E7AD17@MBX021-E3-NJ-2.exch021.domain.local> <869970D71E26D7498BDAC4E1CA92226B86E7ADB9@MBX021-E3-NJ-2.exch021.domain.local> In-Reply-To: Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.18.168.138] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org Hi Hitesh, Yes it is an issue. This is handled in https://issues.apache.org/= jira/i#browse/YARN-713 fixes DNS Issue. This fix available on hadoop-2.4(un= released). Thanks & Regards Rohith Sharma K S -----Original Message----- From: Hitesh Shah [mailto:hitesh@apache.org]=20 Sent: 14 March 2014 09:03 To: user@hadoop.apache.org Subject: Re: ResourceManager shutting down Hi John Would you mind filing a jira with more details. The RM going down just beca= use a host was not resolvable or DNS timed out is something that should be = addressed. thanks -- Hitesh On Mar 13, 2014, at 2:29 PM, John Lilley wrote: > Never mind... we figured out its DNS entry was going missing. > john > =20 > From: John Lilley [mailto:john.lilley@redpoint.net] > Sent: Thursday, March 13, 2014 2:52 PM > To: user@hadoop.apache.org > Subject: ResourceManager shutting down > =20 > We have this erratic behavior where every so often the RM will shutdown w= ith an UnknownHostException. The odd thing is, the host it complains about= have been in use for days at that point without problem. Any ideas? > Thanks, > John > =20 > =20 > 2014-03-13 14:38:14,746 INFO rmapp.RMAppImpl=20 > (RMAppImpl.java:handle(578)) - application_1394204725813_0220 State=20 > change from ACCEPTED to RUNNING > 2014-03-13 14:38:15,794 FATAL resourcemanager.ResourceManager=20 > (ResourceManager.java:run(449)) - Error in handling event type=20 > NODE_UPDATE to the scheduler > java.lang.IllegalArgumentException: java.net.UnknownHostException: skitzo= .office.datalever.com > at org.apache.hadoop.security.SecurityUtil.buildTokenService(Secu= rityUtil.java:418) > at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerT= oken(BuilderUtils.java:247) > at org.apache.hadoop.yarn.server.resourcemanager.security.RMConta= inerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.j= ava:195) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.LeafQueue.createContainerToken(LeafQueue.java:1297) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.LeafQueue.assignContainer(LeafQueue.java:1345) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1211) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.LeafQueue.assignContainersOnNode(LeafQueue.java:1170) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.LeafQueue.assignContainers(LeafQueue.java:871) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.ParentQueue.assignContainersToChildQueues(ParentQueue.java:645) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.ParentQueue.assignContainers(ParentQueue.java:559) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.CapacityScheduler.nodeUpdate(CapacityScheduler.java:690) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.CapacityScheduler.handle(CapacityScheduler.java:734) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci= ty.CapacityScheduler.handle(CapacityScheduler.java:86) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$= SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.net.UnknownHostException: skitzo.office.datalever.com > ... 15 more > 2014-03-13 14:38:15,794 INFO resourcemanager.ResourceManager (ResourceMa= nager.java:run(453)) - Exiting, bbye.. > 2014-03-13 14:38:15,911 INFO mortbay.log (Slf4jLog.java:info(67)) -=20 > Stopped SelectChannelConnector@metallica.office.datalever.com:8088 > 2014-03-13 14:38:16,013 ERROR=20 > delegation.AbstractDelegationTokenSecretManager=20 > (AbstractDelegationTokenSecretManager.java:run(557)) -=20 > InterruptedExcpetion recieved for ExpiredTokenRemover thread=20 > java.lang.InterruptedException: sleep interrupted > 2014-03-13 14:38:16,013 INFO impl.MetricsSystemImpl (MetricsSystemImpl.j= ava:stop(200)) - Stopping ResourceManager metrics system... > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl (MetricsSystemImpl.j= ava:stop(206)) - ResourceManager metrics system stopped. > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl (MetricsSystemImpl.j= ava:shutdown(572)) - ResourceManager metrics system shutdown complete. > 2014-03-13 14:38:16,015 WARN amlauncher.ApplicationMasterLauncher (Appli= cationMasterLauncher.java:run(98)) - org.apache.hadoop.yarn.server.resource= manager.amlauncher.ApplicationMasterLauncher$LauncherThread interrupted. Re= turning. > 2014-03-13 14:38:16,015 INFO ipc.Server (Server.java:stop(2442)) -=20 > Stopping server on 8141 > 2014-03-13 14:38:16,017 INFO ipc.Server (Server.java:stop(2442)) -=20 > Stopping server on 8050 ... and so on, it shuts down > =20