hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Shah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-196) Nodemanager if started before starting Resource manager is getting shutdown.But if both RM and NM are started and then after if RM is going down,NM is retrying for the RM.
Date Fri, 08 Feb 2013 22:47:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13574908#comment-13574908
] 

Hitesh Shah commented on YARN-196:
----------------------------------

Xuan, some comments:
  - testNMShutdownForRegistrationFailure tests for an explicit command from the RM telling
the NM to shut down. 
  - failing within 10 seconds seems too quick. The rpc layer internally retries for a certain
time period. From the NM layer, we should probably have a total time length defined - say
15 mins and retry after 30 seconds or so within that time period. 
  - also someone should be able to set the time period to -1 to disable the upper bound and
retry forever if needed.
  - use same conventions as used elsewhere when naming variables - rm_Retry_interval_ms does
not confirm to the standards defined in the class.
  - "LOG.debug("Fail to connect to RM");" - change to error and log the exception stack trace
unless it is being caught elsewhere and being printed. It would also help to log how many
retries were attempted before failing out.
  - in the start() function, there is an AvroRuntimeException being thrown - we should replace
that with YarnException or an appropriate runtime exception. 
  - isRMStarted var is not needed - a simple break in the loop if the registration is done
should suffice. 
  - please remove the space in "rm_Retry_Count --;"
  - the debug log message at the end of the loop should be set to use WARN level. Also, please
re-phrase it for more clarity - something along the lines of Retrying connecting to RM, current
no. of failed attempts ...
  - the current patch seems to be catching all exceptions. This will cause a problem in the
case where the RM explicitly asks the NM to shutdown - maybe it makes sense to move the retry
logic into the registerWithRM function?

                
> Nodemanager if started before starting Resource manager is getting shutdown.But if both
RM and NM are started and then after if RM is going down,NM is retrying for the RM.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-196
>                 URL: https://issues.apache.org/jira/browse/YARN-196
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.0.0-alpha
>            Reporter: Ramgopal N
>            Assignee: Xuan Gong
>         Attachments: MAPREDUCE-3676.patch, YARN-196.1.patch, YARN-196.2.patch
>
>
> If NM is started before starting the RM ,NM is shutting down with the following error
> {code}
> ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting services org.apache.hadoop.yarn.server.nodemanager.NodeManager
> org.apache.avro.AvroRuntimeException: java.lang.reflect.UndeclaredThrowableException
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149)
> 	at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242)
> Caused by: java.lang.reflect.UndeclaredThrowableException
> 	at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145)
> 	... 3 more
> Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From
HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException:
Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131)
> 	at $Proxy23.registerNodeManager(Unknown Source)
> 	at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
> 	... 5 more
> Caused by: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025
failed on connection exception: java.net.ConnectException: Connection refused; For more details
see:  http://wiki.apache.org/hadoop/ConnectionRefused
> 	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1141)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1100)
> 	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128)
> 	... 7 more
> Caused by: java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> 	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:659)
> 	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
> 	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
> 	at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
> 	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1117)
> 	... 9 more
> 2012-01-16 15:04:13,336 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher
thread interrupted
> java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
> 	at java.lang.Thread.run(Thread.java:619)
> 2012-01-16 15:04:13,337 INFO org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher
is stopped.
> 2012-01-16 15:04:13,392 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:9999
> 2012-01-16 15:04:13,493 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer
is stopped.
> 2012-01-16 15:04:13,493 INFO org.apache.hadoop.ipc.Server: Stopping server on 24290
> 2012-01-16 15:04:13,494 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener
on 24290
> 2012-01-16 15:04:13,495 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
> 2012-01-16 15:04:13,496 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
is stopped.
> 2012-01-16 15:04:13,496 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher
thread interrupted
> java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
> 	at java.lang.Thread.run(Thread.java:619)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message