hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
Date Fri, 19 Sep 2014 06:12:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140052#comment-14140052
] 

Jian He commented on YARN-1372:
-------------------------------

Thanks for updating! looks good overall, some comments:
- we may add the AM container into the justFinishedContainers as well in some transitions,
so that the AM container instance could be removed on NM.
- this method exceed 80 limit. the method doesn’t need to add containerStatus as a parameter,
as the containerStatus can be got from containerFinishedEvent.
{code}
private static void addJustFinishedContainer(RMAppAttemptImpl appAttem…..
{code}
- MockRM,  import changes only, we can revert.
- transferStateFromPreviousAttempt is currently invoked by RMApp only if work-preserving restart
is enabled. As per discussion, we are going to transfer them irrespectively. 
- For the following code,
{code}
if (recentlyStoppedContainers.get(cid) < currentTime || !context.getContainers().containsKey(cid))
{
   i.remove();
   context.getContainers().remove(cid);
   context.getNMStateStore().removeContainer(cid);
}
{code}
should the logic be like this ? So that the state-store can preserve the containers until
it the container is removed from context.
{code}
if (recentlyStoppedContainers.get(cid) < currentTime) {
    i.remove();
    if (!context.getContainers().containsKey(cid)) {
        context.getNMStateStore().removeContainer(cid);
    }
}
{code}
- Could you merge the following two transitions up to where the previously corresponding (DECOMMISSIONED/LOST)
state transitions are, maybe add the same transition for REBOOTED state as well ? (Also, there’s
a typo in the first transition state, it’s supposed to be DECOMMISSIONED instead of UNHEALTHY)
{code}
//Transitions from DECOMMISSIONED state
     .addTransition(NodeState.UNHEALTHY, NodeState.UNHEALTHY,
         RMNodeEventType.FINISHED_CONTAINERS_PULLED_BY_AM,
         new FinishedContainersPulledByAMTransition())
     //Transitions from LOST state
     .addTransition(NodeState.LOST, NodeState.LOST,
         RMNodeEventType.FINISHED_CONTAINERS_PULLED_BY_AM,
         new FinishedContainersPulledByAMTransition())
{code}

> Ensure all completed containers are reported to the AMs across RM restart
> -------------------------------------------------------------------------
>
>                 Key: YARN-1372
>                 URL: https://issues.apache.org/jira/browse/YARN-1372
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch,
YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch,
YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch, YARN-1372.006.patch, YARN-1372.prelim.patch,
YARN-1372.prelim2.patch
>
>
> Currently the NM informs the RM about completed containers and then removes those containers
from the RM notification list. The RM passes on that completed container information to the
AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may
not be able to get this information again. To fix this, NM should maintain a separate list
of such completed container notifications sent to the RM. After the AM has pulled the containers
from the RM then the RM will inform the NM about it and the NM can remove the completed container
from the new list. Upon re-register with the RM (after RM restart) the NM should send the
entire list of completed containers to the RM along with any other containers that completed
while the RM was dead. This ensures that the RM can inform the AM's about all completed containers.
Some container completions may be reported more than once since the AM may have pulled the
container but the RM may die before notifying the NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message