Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 75A16200CBE for ; Sat, 8 Jul 2017 00:00:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7403816A263; Fri, 7 Jul 2017 22:00:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9212F16A262 for ; Sat, 8 Jul 2017 00:00:06 +0200 (CEST) Received: (qmail 14746 invoked by uid 500); 7 Jul 2017 22:00:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 14735 invoked by uid 99); 7 Jul 2017 22:00:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jul 2017 22:00:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4A8A91A7AEE for ; Fri, 7 Jul 2017 22:00:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id SQBF3BQr6S3Y for ; Fri, 7 Jul 2017 22:00:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id B9A4C5F60D for ; Fri, 7 Jul 2017 22:00:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 33A35E0984 for ; Fri, 7 Jul 2017 22:00:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1C8462469E for ; Fri, 7 Jul 2017 22:00:00 +0000 (UTC) Date: Fri, 7 Jul 2017 22:00:00 +0000 (UTC) From: "Jason Lowe (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-6409) RM does not blacklist node for AM launch failures MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 07 Jul 2017 22:00:07 -0000 [ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078762#comment-16078762 ] Jason Lowe commented on YARN-6409: ---------------------------------- Sorry for the long delay in responding -- was out of the office for quite a bit lately. The approach seems reasonable to me assuming we're getting some level of retries during the AM launch from the container manager proxy in AMLauncher. If we can't get an AM to launch on that node even after some retries then it is very likely a subsequent attempt will also fail. IMHO that's an appropriate time to blacklist. If the container manager proxy is _not_ doing the retries in this case then that would be the first place to fix -- we should be trying a bit harder to get the current attempt launched before jumping to blacklist conclusions. I'm confused on how the NM is capable of regularly heartbeating (and thus getting scheduled for AM launches) but also regularly not responding to launch requests. If this is a common occurrence then that needs to be root-caused. This proposed change is not really a fix for that, just a workaround in case it occurs. Without a fix it will lead to prolonged AM and task launch times, since I'm assuming AMs will see similar difficulties trying to launch tasks on a node if the RM cannot launch an AM on it. > RM does not blacklist node for AM launch failures > ------------------------------------------------- > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 3.0.0-alpha2 > Reporter: Haibo Chen > Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that happen after AM container is launched (see RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch can also fail if the NM, where the AM container is allocated, goes unresponsive. Because it is not handled, scheduler may continue to allocate AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error launching appattempt_1478721503753_0870_000002. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host Details : local host is: "*.me.com/17.111.179.113"; destination host is: "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) > at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) > ... 18 more > . Failing the application. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org