hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Kanter (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits
Date Sat, 08 Feb 2014 01:01:19 GMT

     [ https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Kanter updated YARN-1490:
--------------------------------

    Attachment: org.apache.oozie.service.TestRecoveryService_thread-dump.txt

As reported in the "Re-swizzle 2.3" email thread on the mailing lists, when running the Oozie
unit tests we saw some weird behavior after YARN-1490:
Basically, we use a single MiniMRCluster and MiniDFSCluster across all unit tests in a module.
 With YARN-1490 we saw that, regardless of test order, the last few tests would timeout waiting
for an MR job to finish; on slower machines, the entire test suite would timeout.  Through
some digging, I found that we were getting a ton of “Connection refused” Exceptions on
LeaseRenewer talking to the NN and a few on the AM talking to the RM.
So it sounds like there's something that happens over time...

I've attached a thread dump (org.apache.oozie.service.TestRecoveryService_thread-dump.txt)
taken during the test where we saw the timeout; though its possible that the issue manifests
itself earlier but isn't noticeable until then.

And here is one of the exceptions that we see in the in the MiniMRCluster's syslog for the
container used during that test; it repeats many many times:
{noformat}
2014-02-07 14:42:22,998 WARN [LeaseRenewer:test@localhost:56186] org.apache.hadoop.hdfs.LeaseRenewer:
Failed to renew lease for [DFSClient_NONMAPREDUCE_-1380838220_1] for 2419 seconds.  Will retry
shortly ...
java.net.ConnectException: Call From rkanter-mbp.local/172.16.1.64 to localhost:56186 failed
on connection exception: java.net.ConnectException: Connection refused; For more details see:
 http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.GeneratedConstructorAccessor17.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at org.apache.hadoop.ipc.Client.call(Client.java:1359)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at com.sun.proxy.$Proxy9.renewLease(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy9.renewLease(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:519)
        at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:773)
        at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
        at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
        at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
        at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696)
        at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1458)
        at org.apache.hadoop.ipc.Client.call(Client.java:1377)
        ... 16 more
{noformat}

I'm going to continue looking into why YARN-1490 is causing this behavior, but I thought I'd
post this info here in case anyone has any ideas.

> RM should optionally not kill all containers when an ApplicationMaster exits
> ----------------------------------------------------------------------------
>
>                 Key: YARN-1490
>                 URL: https://issues.apache.org/jira/browse/YARN-1490
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Jian He
>             Fix For: 2.4.0
>
>         Attachments: YARN-1490.1.patch, YARN-1490.10.patch, YARN-1490.11.patch, YARN-1490.11.patch,
YARN-1490.12.patch, YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, YARN-1490.5.patch,
YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, YARN-1490.9.patch, org.apache.oozie.service.TestRecoveryService_thread-dump.txt
>
>
> This is needed to enable work-preserving AM restart. Some apps can chose to reconnect
with old running containers, some may not want to. This should be an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message