hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Botong Huang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM
Date Thu, 28 Jun 2018 18:38:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526655#comment-16526655

Botong Huang commented on YARN-8451:

Good point, fixed in v2 patch!

> Multiple NM heartbeat thread created when a slow NM resync with RM
> ------------------------------------------------------------------
>                 Key: YARN-8451
>                 URL: https://issues.apache.org/jira/browse/YARN-8451
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Major
>         Attachments: YARN-8451.v1.patch, YARN-8451.v2.patch
> During a NM resync with RM (say RM did a master slave switch), if NM is running slow,
more than one RESYNC event may be put into the NM dispatcher by the existing heartbeat thread
before they are processed. As a result, multiple new heartbeat thread are later created and
start to hb to RM concurrently with their own responseId. If at some point of time, one thread
becomes more than one step behind others, RM will send back a resync signal in this heartbeat
response, killing all containers in this NM. 
> See comments below for details on how this can happen. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message