Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 49306C3AC for ; Fri, 14 Mar 2014 15:05:48 +0000 (UTC) Received: (qmail 11879 invoked by uid 500); 14 Mar 2014 15:05:40 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 11579 invoked by uid 500); 14 Mar 2014 15:05:38 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 11572 invoked by uid 99); 14 Mar 2014 15:05:36 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Mar 2014 15:05:36 +0000 Received: from localhost (HELO [192.168.1.171]) (127.0.0.1) (smtp-auth username hitesh, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Mar 2014 15:05:36 +0000 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Apple Message framework v1085) Subject: Re: ResourceManager shutting down From: Hitesh Shah In-Reply-To: <869970D71E26D7498BDAC4E1CA92226B86E7ADB9@MBX021-E3-NJ-2.exch021.domain.local> Date: Fri, 14 Mar 2014 08:05:33 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <90EE5139-D26F-4C79-BFE3-DCD1C020FC4B@apache.org> References: <869970D71E26D7498BDAC4E1CA92226B86E7AD17@MBX021-E3-NJ-2.exch021.domain.local> <869970D71E26D7498BDAC4E1CA92226B86E7ADB9@MBX021-E3-NJ-2.exch021.domain.local> To: user@hadoop.apache.org X-Mailer: Apple Mail (2.1085) Hi John Would you mind filing a jira with more details. The RM going down just = because a host was not resolvable or DNS timed out is something that = should be addressed. thanks -- Hitesh On Mar 13, 2014, at 2:29 PM, John Lilley wrote: > Never mind=85 we figured out its DNS entry was going missing. > john > =20 > From: John Lilley [mailto:john.lilley@redpoint.net]=20 > Sent: Thursday, March 13, 2014 2:52 PM > To: user@hadoop.apache.org > Subject: ResourceManager shutting down > =20 > We have this erratic behavior where every so often the RM will = shutdown with an UnknownHostException. The odd thing is, the host it = complains about have been in use for days at that point without problem. = Any ideas? > Thanks, > John > =20 > =20 > 2014-03-13 14:38:14,746 INFO rmapp.RMAppImpl = (RMAppImpl.java:handle(578)) - application_1394204725813_0220 State = change from ACCEPTED to RUNNING > 2014-03-13 14:38:15,794 FATAL resourcemanager.ResourceManager = (ResourceManager.java:run(449)) - Error in handling event type = NODE_UPDATE to the scheduler > java.lang.IllegalArgumentException: java.net.UnknownHostException: = skitzo.office.datalever.com > at = org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.jav= a:418) > at = org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(Builder= Utils.java:247) > at = org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSec= retManager.createContainerToken(RMContainerTokenSecretManager.java:195) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue= .createContainerToken(LeafQueue.java:1297) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue= .assignContainer(LeafQueue.java:1345) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue= .assignOffSwitchContainers(LeafQueue.java:1211) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue= .assignContainersOnNode(LeafQueue.java:1170) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue= .assignContainers(LeafQueue.java:871) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQue= ue.assignContainersToChildQueues(ParentQueue.java:645) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQue= ue.assignContainers(ParentQueue.java:559) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityS= cheduler.nodeUpdate(CapacityScheduler.java:690) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityS= cheduler.handle(CapacityScheduler.java:734) > at = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityS= cheduler.handle(CapacityScheduler.java:86) > at = org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEve= ntDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.net.UnknownHostException: skitzo.office.datalever.com > ... 15 more > 2014-03-13 14:38:15,794 INFO resourcemanager.ResourceManager = (ResourceManager.java:run(453)) - Exiting, bbye.. > 2014-03-13 14:38:15,911 INFO mortbay.log (Slf4jLog.java:info(67)) - = Stopped SelectChannelConnector@metallica.office.datalever.com:8088 > 2014-03-13 14:38:16,013 ERROR = delegation.AbstractDelegationTokenSecretManager = (AbstractDelegationTokenSecretManager.java:run(557)) - = InterruptedExcpetion recieved for ExpiredTokenRemover thread = java.lang.InterruptedException: sleep interrupted > 2014-03-13 14:38:16,013 INFO impl.MetricsSystemImpl = (MetricsSystemImpl.java:stop(200)) - Stopping ResourceManager metrics = system... > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl = (MetricsSystemImpl.java:stop(206)) - ResourceManager metrics system = stopped. > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl = (MetricsSystemImpl.java:shutdown(572)) - ResourceManager metrics system = shutdown complete. > 2014-03-13 14:38:16,015 WARN amlauncher.ApplicationMasterLauncher = (ApplicationMasterLauncher.java:run(98)) - = org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMaster= Launcher$LauncherThread interrupted. Returning. > 2014-03-13 14:38:16,015 INFO ipc.Server (Server.java:stop(2442)) - = Stopping server on 8141 > 2014-03-13 14:38:16,017 INFO ipc.Server (Server.java:stop(2442)) - = Stopping server on 8050 > =85 and so on, it shuts down > =20