hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
Date Mon, 09 Nov 2015 13:41:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996539#comment-14996539
] 

Jun Gong commented on YARN-2047:
--------------------------------

Another thought: RM rebuilds containers' information form AMs.  

When AM re-register with RM, AM tells its running containers' information to RM. Then RM records
them in a HashSet *amRunningContainers*, queries them by calling *getRMContainer(containerId)*,
and deletes them from *amRunningContainers* if the RMContainer exists.  When NM re-register
with RM, RM deletes all the containers that NM reports from *amRunningContainers*. After some
time(NM expiry time), RM iterates *amRunningContainers*, and tells corresponding AM they have
finished.

The result seems same as the issue aims. However it needs add or modify AM's register RPC.

> RM should honor NM heartbeat expiry after RM restart
> ----------------------------------------------------
>
>                 Key: YARN-2047
>                 URL: https://issues.apache.org/jira/browse/YARN-2047
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially decommissioned
status too). After restart, the RM cannot maintain the contract to the AM's that a lost NM's
containers will be marked finished within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message