Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2ECF3D649 for ; Wed, 29 Aug 2012 23:51:08 +0000 (UTC) Received: (qmail 11434 invoked by uid 500); 29 Aug 2012 23:51:08 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 11406 invoked by uid 500); 29 Aug 2012 23:51:08 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 11396 invoked by uid 99); 29 Aug 2012 23:51:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Aug 2012 23:51:08 +0000 Date: Thu, 30 Aug 2012 10:51:08 +1100 (NCT) From: "Vinod Kumar Vavilapalli (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: <1342278414.14545.1346284268073.JavaMail.jiratomcat@arcas> In-Reply-To: <218805250.12762.1346262369070.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (YARN-60) NMs rejects all container tokens after secret key rolls MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-60: ---------------------------------------- Attachment: YARN-60-20120829.txt The bug is in {{ResourceTrackerService}} which was storing _lastKnownMasterKey_ for a node in {{RMNode}} incorrectly. After each update, it was memorizing _nextMasterKey_ which will be reset to null immediately on roll-over. Because of this, NMs get NPEs, don't update their master-keys and hence reject all the containers. In the attached patch - I removed storing _currentKey_ in RMNode altogether. Now NM always reports its _lastKnownMasterKey_ and RM checks if any updates need to be sent. - Added {{TestRMNMSecretKeys}} to make sure that RM sends correct key-updates to NM. This was the test which caught the issue for me, it will fail without the code changes and passes after the fix. - I also earlier did a bogus fix (without really knowing though :) ) for {{TestContainerManagerSecurity}} over at YARN-39. Corrected that too. We also need to modify {{NodeStatusUpdaterImpl}} to react properly on exceptions from RM. Will do so separately. > NMs rejects all container tokens after secret key rolls > ------------------------------------------------------- > > Key: YARN-60 > URL: https://issues.apache.org/jira/browse/YARN-60 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.2.0-alpha, 0.23.3 > Reporter: Daryn Sharp > Assignee: Vinod Kumar Vavilapalli > Priority: Blocker > Attachments: YARN-60-20120829.txt > > > The NM's token secret manager will reject all container tokens after the secret key is activated which means the NM will not launch _any_ containers including AMs. The whole yarn cluster becomes inoperable in 1d. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira