hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
Date Fri, 09 Sep 2016 16:20:22 GMT

    [ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477474#comment-15477474

Jian He commented on YARN-5620:

Thanks Arun, some more comments and questions:

- The reInitContext is updated asynchronously via event, but here it’s being checked synchronously
in the upgrade API.
if (!container.isRunning() || container.isReInitializing()) {
- Also, if it’s ignored here, then it appears to AM that the upgrade call is somehow ignored.
And user who issues the upgrade command will get confused as the call is ignored.
if (container.reInitContext != null) {
  container.addDiagnostics("Container [" + container.getContainerId()
      + "] ReInitialization already in progress !!");

Overall, I think we can move the logic of ReInitializeContainerTransition to the upgrade API
? All it does is to resend the events to containerLauncher or ResourceLocalizationService
, which can be done in the API. Also, this has the benefit of rejecting the upgrade call while
the previous upgrade is in_progress. 
Current solution has a race condition that if previous upgrade is in_progress, the second
one may be ignored instead of rejected, and user will not get notification that the previous
upgrade is in_progress. 
One other potential race is that the relocalize API has a chance to go through instead of
rejected, while upgrading, as the reInitContext is updated asynchronously, those requested
resources will then be considered as upgrade resources.

- why is checkAndUpdatePending method needed ?
checkAndUpdatePending(rsrcEvent, container.resourceSet, links);
if (container.isReInitializing()) {
      rsrcEvent, container.reInitContext.resourceSet, links); 

- why do we set the reInitContext to be null if once resource localization failed ?
if (container.isReInitializing() &&
        .containsKey(failedEvent.getResource())) {
  LOG.error("Container [" + container.getContainerId() + "] Re-init" +
      " failed !! Resource [" + failedEvent.getResource() + "] could" +
      " not be localized !!");
  container.reInitContext = null; 

- In ResourceLocalizedWhileRunningTransition, the symlink creation part is not needed for
reinit, because it will be done as part of the containerLaunch.
- Given so many if(reinitializing) conditions in containerImpl, should we consider adding
a new state?
- when launching the container, we need to cleanupPreviousContainerFiles as done in ContainerRelaunch,

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -----------------------------------------------------------------------------
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, YARN-5620.003.patch, YARN-5620.004.patch,
YARN-5620.005.patch, YARN-5620.006.patch, YARN-5620.007.patch
> JIRA proposes to modify the ContainerManager (and other core classes) to support upgrade
of a running container with a new {{ContainerLaunchContext}} as well as the ability to rollback
the upgrade if the container is not able to restart using the new launch Context. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message