Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AB714200CFD for ; Wed, 6 Sep 2017 17:15:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A85801609C2; Wed, 6 Sep 2017 15:15:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F1F9F1609BA for ; Wed, 6 Sep 2017 17:15:08 +0200 (CEST) Received: (qmail 20644 invoked by uid 500); 6 Sep 2017 15:15:07 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 20633 invoked by uid 99); 6 Sep 2017 15:15:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Sep 2017 15:15:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id EF53218A891 for ; Wed, 6 Sep 2017 15:15:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id pE-DloL6OGpV for ; Wed, 6 Sep 2017 15:15:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 1866C60EE9 for ; Wed, 6 Sep 2017 15:15:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 552BFE01D8 for ; Wed, 6 Sep 2017 15:15:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1528D24147 for ; Wed, 6 Sep 2017 15:15:00 +0000 (UTC) Date: Wed, 6 Sep 2017 15:15:00 +0000 (UTC) From: "Rohith Sharma K S (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-7163) RM crashes with OOM in secured cluster when HA is enabled MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 06 Sep 2017 15:15:09 -0000 [ https://issues.apache.org/jira/browse/YARN-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-7163: ------------------------------------ Attachment: YARN-7163.01.patch Updating the patch for keeping RM reference in RMDelegationTokenSecretManager rather than RMContext. This always points to new rmcontext which got created during stand by transition so that old rmcontext will be GC ed. > RM crashes with OOM in secured cluster when HA is enabled > --------------------------------------------------------- > > Key: YARN-7163 > URL: https://issues.apache.org/jira/browse/YARN-7163 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Rohith Sharma K S > Assignee: Rohith Sharma K S > Attachments: YARN-7163.01.patch > > > It is observed that RM crashes with heap space OOM in secure cluster(http authentication is kerborse) when RM HA is enabled. > Scenario is > 1. Start RM in HA secure mode. Lets say RM1 is active mode. > 2. Run many applications so that it uses greater than 50% of heap space configured. Lets say, if heap space is 2GB, then run applications that occupy 1.5GB of heap space. > 3. Switch RM to StandBy and bring back to Active! While recovering applications from state store, RM crashes with OOM. > *Note* : This issue will happen only when RM is started as ACTIVE directly. (not switched from standby to active during start of JVM) > Heap dump shows that RMAuthenticationFilter holds 60% heap space! And other 40% held by RMAppState which is during recovering from state store. This exceeds the heap space and crashes with OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org