Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D7A3418AB1 for ; Wed, 13 May 2015 16:18:00 +0000 (UTC) Received: (qmail 19671 invoked by uid 500); 13 May 2015 16:18:00 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 19629 invoked by uid 500); 13 May 2015 16:18:00 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 19618 invoked by uid 99); 13 May 2015 16:18:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2015 16:18:00 +0000 Date: Wed, 13 May 2015 16:18:00 +0000 (UTC) From: "Sangjin Lee (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3634) TestMRTimelineEventHandling and TestApplication are broken MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542177#comment-14542177 ] Sangjin Lee commented on YARN-3634: ----------------------------------- Thanks for the feedback [~djp]! Actually serviceStart() no longer calls getNMCollectorService() to initialize it. The issue is that it depends on the order of these services starting up (between NodeTimelineCollectorManager and NMCollectorService), and it turns out currently NodeTimelineCollectorManager starts before NMCollectorService. The initialization of the NMCollectorService RPC client is now delayed until the first use (that's why direct references to nmCollectorService are replaced by the getNMCollectorService() calls). And that's' the reason synchronization is needed to prevent multiple threads competing to initialize the RPC client, as it would be wasteful and potentially incorrect. Hope that makes it clear. > TestMRTimelineEventHandling and TestApplication are broken > ---------------------------------------------------------- > > Key: YARN-3634 > URL: https://issues.apache.org/jira/browse/YARN-3634 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Affects Versions: YARN-2928 > Reporter: Sangjin Lee > Assignee: Sangjin Lee > Attachments: YARN-3634-YARN-2928.001.patch, YARN-3634-YARN-2928.002.patch, YARN-3634-YARN-2928.003.patch > > > TestMRTimelineEventHandling is broken. Relevant error message: > {noformat} > 2015-05-12 06:28:56,415 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:28:57,416 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:28:58,416 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:28:59,417 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:29:00,418 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:29:01,419 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:29:02,420 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:29:03,420 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:29:04,421 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:29:05,422 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) > 2015-05-12 06:29:05,424 ERROR [AsyncDispatcher event handler] collector.NodeTimelineCollectorManager (NodeTimelineCollectorManager.java:postPut(121)) - Failed to communicate with NM Collector Service for application_1431412130291_0001 > 2015-05-12 06:29:05,425 WARN [AsyncDispatcher event handler] containermanager.AuxServices (AuxServices.java:logWarningWhenAuxServiceThrowExceptions(261)) - The auxService name is timeline_collector and it got an error at event: CONTAINER_INIT > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From asf904.gq1.ygridcore.net/67.195.81.148 to asf904.gq1.ygridcore.net:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused > at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager.putIfAbsent(TimelineCollectorManager.java:97) > at org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService.addApplication(PerNodeTimelineCollectorsAuxService.java:99) > at org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService.initializeContainer(PerNodeTimelineCollectorsAuxService.java:126) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:226) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:49) > at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) > at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From asf904.gq1.ygridcore.net/67.195.81.148 to asf904.gq1.ygridcore.net:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused > at org.apache.hadoop.yarn.server.timelineservice.collector.NodeTimelineCollectorManager.postPut(NodeTimelineCollectorManager.java:122) > at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager.putIfAbsent(TimelineCollectorManager.java:95) > ... 7 more > Caused by: java.net.ConnectException: Call From asf904.gq1.ygridcore.net/67.195.81.148 to asf904.gq1.ygridcore.net:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) > at org.apache.hadoop.ipc.Client.call(Client.java:1496) > at org.apache.hadoop.ipc.Client.call(Client.java:1423) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy108.getTimelineCollectorContext(Unknown Source) > at org.apache.hadoop.yarn.server.api.impl.pb.client.CollectorNodemanagerProtocolPBClientImpl.getTimelineCollectorContext(CollectorNodemanagerProtocolPBClientImpl.java:99) > at org.apache.hadoop.yarn.server.timelineservice.collector.NodeTimelineCollectorManager.updateTimelineCollectorContext(NodeTimelineCollectorManager.java:188) > at org.apache.hadoop.yarn.server.timelineservice.collector.NodeTimelineCollectorManager.postPut(NodeTimelineCollectorManager.java:116) > ... 8 more > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) > at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:625) > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:723) > at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1545) > at org.apache.hadoop.ipc.Client.call(Client.java:1462) > ... 14 more > {noformat} > This surfaced when we switched to use port ":0" for the mini-YARN cluster for the node collector service. > Also, TestApplication tests are broken because the mocked context does not have the configuration object which ApplicationImpl depends on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)