Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5CE69200C2B for ; Thu, 2 Mar 2017 16:28:51 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5B6DE160B6F; Thu, 2 Mar 2017 15:28:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7FC34160B61 for ; Thu, 2 Mar 2017 16:28:50 +0100 (CET) Received: (qmail 24038 invoked by uid 500); 2 Mar 2017 15:28:49 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 24027 invoked by uid 99); 2 Mar 2017 15:28:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Mar 2017 15:28:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3C4431A00E2 for ; Thu, 2 Mar 2017 15:28:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.546 X-Spam-Level: X-Spam-Status: No, score=-1.546 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-2.999, SPF_NEUTRAL=0.652, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 3GipuEPcRkQ4 for ; Thu, 2 Mar 2017 15:28:47 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 8E0465FB62 for ; Thu, 2 Mar 2017 15:28:46 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id BCCD8E0294 for ; Thu, 2 Mar 2017 15:28:45 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 7A0402415C for ; Thu, 2 Mar 2017 15:28:45 +0000 (UTC) Date: Thu, 2 Mar 2017 15:28:45 +0000 (UTC) From: "Aleksandr Balitsky (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 02 Mar 2017 15:28:51 -0000 [ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892400#comment-15892400 ] Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:27 PM: ----------------------------------------------------------------------- Hi [~haibochen], [~jlowe] Sorry for late reply. {quote} Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on the application submission context is true. Last I checked the MapReduce client (YARNRunner) wasn't specifying that when the application is submitted to YARN. {quote} Actually you are right. I did not consider that MR doesn't support AM work-preserving restart and currently I see that my first patch isn't good solution for this problem. Thanks for the review! {quote} Aleksandr Balitsky, which scheduler were you running? {quote} I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will check it with another schedulers. {quote} We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc in SchedulerApplicationAttempt.java here {quote} I totally agree with you that we have not made changes to preserve containers in MR. But the solution that you mentioned contradicts YARN design: {quote} As for network optimization, NMTokens are not sent to the ApplicationMasters for each and every allocated container, but only for the first time or if NMTokens have to be invalidated due to the rollover of the underlying master key {quote} That's so true, it is possible that a null NMToken is sent to MR. NMTokens sends only after first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's not necessary to pass NM tokens during each allocation interaction. So, it's not the best decision to clear NMTokenSecretManager cache during each allocation, because it disables "cache" feature and new NM Tokens will be generated (instead of using instance from cache) during each allocation response. IMHO, we shouldn't do this, because it's not the fix for root cause. It looks like workaround. was (Author: abalitsky1): Hi [~haibochen], [~jlowe] Sorry for late reply. {quote} Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on the application submission context is true. Last I checked the MapReduce client (YARNRunner) wasn't specifying that when the application is submitted to YARN. {quote} Actually you are right. I did not consider that MR doesn't support AM work-preserving restart and currently I see that my first patch isn't good solution for this problem. Thanks for the review! {quote} Aleksandr Balitsky, which scheduler were you running? {quote} I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will check it with another schedulers. {quote} We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc in SchedulerApplicationAttempt.java here {quote} I totally agree with you that we have not made changes to reserve containers in MR. But the solution that you mentioned contradicts YARN design: {quote} As for network optimization, NMTokens are not sent to the ApplicationMasters for each and every allocated container, but only for the first time or if NMTokens have to be invalidated due to the rollover of the underlying master key {quote} That's so true, it is possible that a null NMToken is sent to MR. NMTokens sends only after first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's not necessary to pass NM tokens during each allocation interaction. So, it's not the best decision to clear NMTokenSecretManager cache during each allocation, because it disables "cache" feature and new NM Tokens will be generated (instead of using instance from cache) during each allocation response. IMHO, we shouldn't do this, because it's not the fix for root cause. It looks like workaround. > MR application fails with "No NMToken sent" exception after MRAppMaster recovery > -------------------------------------------------------------------------------- > > Key: MAPREDUCE-6834 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: resourcemanager, yarn > Affects Versions: 2.7.0 > Environment: Centos 7 > Reporter: Aleksandr Balitsky > Assignee: Aleksandr Balitsky > Priority: Critical > Attachments: YARN-6019.001.patch > > > *Steps to reproduce:* > 1) Submit MR application (for example PI app with 50 containers) > 2) Find MRAppMaster process id for the application > 3) Kill MRAppMaster by kill -9 command > *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt and application finish correctly > *Actually:* After launching new MRAppMaster and MRAppAttempt the application fails with the following exception: > {noformat} > 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container launch failed for container_1482408247195_0002_02_000011 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for node1:43037 > at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254) > at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244) > at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129) > at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395) > at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) > at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > *Problem*: > When RMCommunicator sends "registerApplicationMaster" request to RM, RM generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted to RMCommunicator in RegisterApplicationMasterResponse (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in RMCommunicator.register method. RM don't transmit tese tokens again for other allocated requests, but we don't have these tokens in NMTokenCache. Accordingly we get "No NMToken sent for node" exception. > I have found that this issue appears after changes from the https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed > I tried to do the same scenario without the commit and application completed successfully after RMAppMaster recovery -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org