Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C290E18150 for ; Sat, 27 Feb 2016 02:58:18 +0000 (UTC) Received: (qmail 64492 invoked by uid 500); 27 Feb 2016 02:58:18 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 64415 invoked by uid 500); 27 Feb 2016 02:58:18 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 64401 invoked by uid 99); 27 Feb 2016 02:58:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Feb 2016 02:58:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 2831F2C1F6C for ; Sat, 27 Feb 2016 02:58:18 +0000 (UTC) Date: Sat, 27 Feb 2016 02:58:18 +0000 (UTC) From: "Rohith Sharma K S (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170307#comment-15170307 ] Rohith Sharma K S commented on YARN-4741: ----------------------------------------- I am not pretty sure whether it is same YARN-3990. Based on the affect version I am suspecting it might be a same issue. On the other hand, looking into event type, it may be new issue also. Anyway [~sjlee0] can you cross verify the fix of YARN-3990 is present in your cluster? > RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue > ----------------------------------------------------------------------------------------------- > > Key: YARN-4741 > URL: https://issues.apache.org/jira/browse/YARN-4741 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Sangjin Lee > Priority: Critical > > We had a pretty major incident with the RM where it was continually flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue. > In our setup, we had the RM HA or stateful restart *disabled*, but NM work-preserving restart *enabled*. Due to other issues, we did a cluster-wide NM restart. > Some time during the restart (which took multiple hours), we started seeing the async dispatcher event queue building. Normally it would log 1,000. In this case, it climbed all the way up to tens of millions of events. > When we looked at the RM log, it was full of the following messages: > {noformat} > 2016-02-18 01:47:29,530 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state > 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state > 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > {noformat} > And that node in question was restarted a few minutes earlier. > When we inspected the RM heap, it was full of RMNodeFinishedContainersPulledByAMEvents. > Suspecting the NM work-preserving restart, we disabled it and did another cluster-wide rolling restart. Initially that seemed to have helped reduce the queue size, but the queue built back up to several millions and continued for an extended period. We had to restart the RM to resolve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)