hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
Date Mon, 29 Feb 2016 21:35:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172650#comment-15172650
] 

Sangjin Lee commented on YARN-4741:
-----------------------------------

I attached the node manager log. It's pretty much the entirety of the log from the start until
after it's past the point of these events happening for this node in the RM. The only thing
I removed is a section early in the log that lists all the localization service recovering
files.

Unfortunately I no longer have the RM log for this episode.

We do not have YARN-3990 or YARN-3896 applied. Although we should get them in any case, I'm
not sure if those are related to the issue we're seeing.

> RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event
queue
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-4741
>                 URL: https://issues.apache.org/jira/browse/YARN-4741
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sangjin Lee
>            Priority: Critical
>         Attachments: nm.log
>
>
> We had a pretty major incident with the RM where it was continually flooded with RMNodeFinishedContainersPulledByAMEvents
in the async dispatcher event queue.
> In our setup, we had the RM HA or stateful restart *disabled*, but NM work-preserving
restart *enabled*. Due to other issues, we did a cluster-wide NM restart.
> Some time during the restart (which took multiple hours), we started seeing the async
dispatcher event queue building. Normally it would log 1,000. In this case, it climbed all
the way up to tens of millions of events.
> When we looked at the RM log, it was full of the following messages:
> {noformat}
> 2016-02-18 01:47:29,530 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
Can't handle this event at current state
> 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
Can't handle this event at current state
> 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> {noformat}
> And that node in question was restarted a few minutes earlier.
> When we inspected the RM heap, it was full of RMNodeFinishedContainersPulledByAMEvents.
> Suspecting the NM work-preserving restart, we disabled it and did another cluster-wide
rolling restart. Initially that seemed to have helped reduce the queue size, but the queue
built back up to several millions and continued for an extended period. We had to restart
the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message