hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
Date Thu, 11 Sep 2014 01:01:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129459#comment-14129459
] 

Jian He commented on YARN-1372:
-------------------------------

Thanks for updating the patch. Some comments and naming suggestions:
- NodeStatusUpdater import changes only, we can revert
- indentation format of the second line.
{code}
  public void removeCompletedContainersFromContext(List<ContainerId>
                                                       containerIds) throws
  public RMNodeCleanedupContainerNotifiedEvent(NodeId nodeId,
                                               ContainerId contId) {
{code}
- why adding {{context.getContainers().remove(cid);}}  in removeVeryOldStoppedContainersFromContext
method? won’t this remove the containers from context immediately when we send the container
statuses across, which contradicts the rest of the changes?
- In NodeStatusUpdaterImpl, previousCompletedContainers cache is not needed any more, as we
make NM remove containers from context only after it gets the notification. We can remove
this; Instead, in NodeStatusUpdater#getContainerStatuses, while we are looping all the containers,
we can check whether the corresponding application exists, if Not, remove it from context.
- make sure {{context.getNMStateStore().removeContainer(cid);}} is called after receiving
the notification from RM as well.
- {{RMNodeEventType#CLEANEDUP_CONTAINER_NOTIFIED}}: put in a new section where source is RMAppAttempt.
how about rename to FINISHED_CONTAINERS_PULLED_BY_AM; similarly RMNodeCleanedupContainerNotifiedEvent
-> RMNodeFinishedContainersPulledByAMEvent
- In RMAppAttemptImpl#BaseFinalTransition, we can clear finishedContainersSentToAM, in case
that AM unexpectedly crashes.
- I think Map<NodeId, List<ContainerStatus >> is more space efficient than: 
{code}
  private Map<ContainerStatus, NodeId> finishedContainersSentToAM =
      new HashMap<ContainerStatus, NodeId>();
{code}
- :  format convention is to have method body in a different line from the method head.
{code}
public NodeId getNodeId() { return this.nodeId; }
{code}
- RMNodeImpl#cleanupContainersNotified, may be rename to finishedContainersPulledByAM. similarly
CleanedupContainerNotifiedTransition to FinishedContainersPulledByAMTransition.
- NodeHeartbeatResponse#addCleanedupContainersNotified, how about  addFinishedContainersPulledByAM;
similarly for the getter NodeHeartbeatResponse#getCleanedupContainersNotified and the proto
file. Also add some code comments to explain why adding this new API.

> Ensure all completed containers are reported to the AMs across RM restart
> -------------------------------------------------------------------------
>
>                 Key: YARN-1372
>                 URL: https://issues.apache.org/jira/browse/YARN-1372
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch,
YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch,
YARN-1372.prelim.patch, YARN-1372.prelim2.patch
>
>
> Currently the NM informs the RM about completed containers and then removes those containers
from the RM notification list. The RM passes on that completed container information to the
AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may
not be able to get this information again. To fix this, NM should maintain a separate list
of such completed container notifications sent to the RM. After the AM has pulled the containers
from the RM then the RM will inform the NM about it and the NM can remove the completed container
from the new list. Upon re-register with the RM (after RM restart) the NM should send the
entire list of completed containers to the RM along with any other containers that completed
while the RM was dead. This ensures that the RM can inform the AM's about all completed containers.
Some container completions may be reported more than once since the AM may have pulled the
container but the RM may die before notifying the NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message