hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Botong Huang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7102) NM heartbeat stuck when responseId overflows MAX_INT
Date Thu, 14 Sep 2017 21:39:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167000#comment-16167000

Botong Huang commented on YARN-7102:

After fighting through unit tests... in v6 patch: 
{{TestAMRMClientContainerRequest.testOpportunisticAndGuaranteedRequests}} is already failing
in trunk, YARN-7199 opened for it
{{TestContainerAllocation.testAMContainerAllocationWhenDNSUnavailable}} is being tracked under
I need help on {{TestContainerManagerSecurity.testContainerManager}}, it seems consistently
failing in yetus, but I cannot repro locally at all. 

[~wangda] and [~jlowe], can you please take a look? Some quick notes in summary: 

After a more strict responseId check in NM heartbeat, we need to drain the RM dispatcher events
after every {{MockNM}} heartbeat. Otherwise, two sequential {{MockNM}} heartbeat will fail
on the second heartbeat, because RM is still processing the first heartbeat event.

Instead of going through all the place where nm.nodeHeartbeat is called and add rm.drainEvent
afterwards (some already have though), I changed the {{MockNM}} api, and drain RM events inside
the heartbeat method.

For easy review, the real changes are in these four files: {{ResourceTrackerService, MockNM,
TestResourceTrackerService, MiniYarnCluster}} and {{TestMiniYarnClusterNodeUtilization}} (removed
a test case because it is consumed/identical to the other one). All other file changes are
simply because of api change in {{MockNM}}. 

Thanks in advance!

> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>                 Key: YARN-7102
>                 URL: https://issues.apache.org/jira/browse/YARN-7102
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Critical
>         Attachments: YARN-7102.v1.patch, YARN-7102.v2.patch, YARN-7102.v3.patch, YARN-7102.v4.patch,
YARN-7102.v5.patch, YARN-7102.v6.patch
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM heartbeat in YARN-6640,
please refer to YARN-6640 for details. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message