hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3194) After NM restart,completed containers are not released by RM which are sent during NM registration
Date Wed, 18 Feb 2015 09:02:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325610#comment-14325610

Rohith commented on YARN-3194:

Thanks [~jlowe] [~djp] [~jianhe] for detailed review:-)

bq. the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition
Agree, this has to be refactored. Majority of processing containerStatus code is same.

bq. we don't remove containers that have completed from the launchedContainers map which seems
I see, yes. completed containers should be removed from launchedContainers.

bq. I don't see why we would process container status sent during a reconnect differently
than a regular status update from the NM
IIUC it is only to deal with NMContainerStatus and containerStatus. But I am not sure why
these both created differently. What I see is containerStatus is subset of NMcontainerStatus.
I think containerStatus would have been inside NMContainerStatus. 

bq. Is below condition valid for the newly added code in ReconnectNodeTransition too ? 
Yes, it is applicable since we are keeping old RMNode object.

bq. Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart
? and add more detailed comments about what the test is doing too ? 

bq. Could you add a validation that ApplicationMasterService#allocate indeed receives the
completed container in this scenario?
Agree, I will add

bq. Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container
AM memory is 1024 and additional requested container memory is 2048. In test, number of request
container is 1. So AllocatedMB should be AM+Requested i.e 1024+2048=3072

> After NM restart,completed containers are not released by RM which are sent during NM
> --------------------------------------------------------------------------------------------------
>                 Key: YARN-3194
>                 URL: https://issues.apache.org/jira/browse/YARN-3194
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.0
>         Environment: NM restart is enabled
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Blocker
>         Attachments: 0001-yarn-3194-v1.patch
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM process only
ContainerState.RUNNING. If container is completed when NM was down then those containers resources
wont be release which result in applications to hang.

This message was sent by Atlassian JIRA

View raw message