hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7102) NM heartbeat stuck when responseId overflows MAX_INT
Date Wed, 01 Nov 2017 15:40:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234235#comment-16234235

Jason Lowe commented on YARN-7102:

bq. it is indeed a race condition between node heartbeat vs node remove and add. The correct
fix is for TestResourceTrackerService.testReconnect to create MockNM by calling MockRM.registerNode,
in which a RM drain is called before return.

I do not follow the logic here.  This looks like a race condition that could happen outside
the unit tests as well, so we need more than a unit test update to address it.  The problem
is that both heartbeat processing a node reconnect processing can modify the response ID.
 One of them is processed synchronously and the other isn't, so heartbeats can race ahead
of the reconnect.  That needs to be fixed.

One way to address it is to move at least part of the reconnect logic to be processed synchronously
in ResourceTrackerService.  Seems minimally we need to know which RMNodeImpl we're going with
so we can get the right response ID tracked for the next heartbeat from the node.  That way
even if the heartbeat arrives before the reconnect event asynchronously arrives at RMNodeImpl
we have the proper response ID in place to handle the heartbeat correctly.

> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>                 Key: YARN-7102
>                 URL: https://issues.apache.org/jira/browse/YARN-7102
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Critical
>         Attachments: YARN-7102-branch-2.8.v10.patch, YARN-7102-branch-2.8.v11.patch,
YARN-7102-branch-2.8.v9.patch, YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch,
YARN-7102.v1.patch, YARN-7102.v12.patch, YARN-7102.v2.patch, YARN-7102.v3.patch, YARN-7102.v4.patch,
YARN-7102.v5.patch, YARN-7102.v6.patch, YARN-7102.v7.patch, YARN-7102.v8.patch, YARN-7102.v9.patch
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM heartbeat in YARN-6640,
please refer to YARN-6640 for details. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message