hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
Date Fri, 09 Sep 2016 16:20:22 GMT

    [ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477474#comment-15477474
] 

Jian He commented on YARN-5620:
-------------------------------

Thanks Arun, some more comments and questions:

- The reInitContext is updated asynchronously via event, but here it’s being checked synchronously
in the upgrade API.
{code}
if (!container.isRunning() || container.isReInitializing()) {
{code}
- Also, if it’s ignored here, then it appears to AM that the upgrade call is somehow ignored.
And user who issues the upgrade command will get confused as the call is ignored.
{code}
if (container.reInitContext != null) {
  container.addDiagnostics("Container [" + container.getContainerId()
      + "] ReInitialization already in progress !!");
  return; 
}

{code}
Overall, I think we can move the logic of ReInitializeContainerTransition to the upgrade API
? All it does is to resend the events to containerLauncher or ResourceLocalizationService
, which can be done in the API. Also, this has the benefit of rejecting the upgrade call while
the previous upgrade is in_progress. 
Current solution has a race condition that if previous upgrade is in_progress, the second
one may be ignored instead of rejected, and user will not get notification that the previous
upgrade is in_progress. 
One other potential race is that the relocalize API has a chance to go through instead of
rejected, while upgrading, as the reInitContext is updated asynchronously, those requested
resources will then be considered as upgrade resources.

- why is checkAndUpdatePending method needed ?
{code}
checkAndUpdatePending(rsrcEvent, container.resourceSet, links);
if (container.isReInitializing()) {
  checkAndUpdatePending(
      rsrcEvent, container.reInitContext.resourceSet, links); 
}

{code}
- why do we set the reInitContext to be null if once resource localization failed ?
{code}
if (container.isReInitializing() &&
    container.reInitContext.resourceSet.getPendingResources()
        .containsKey(failedEvent.getResource())) {
  LOG.error("Container [" + container.getContainerId() + "] Re-init" +
      " failed !! Resource [" + failedEvent.getResource() + "] could" +
      " not be localized !!");
  container.reInitContext = null; 
}

{code}
- In ResourceLocalizedWhileRunningTransition, the symlink creation part is not needed for
reinit, because it will be done as part of the containerLaunch.
- Given so many if(reinitializing) conditions in containerImpl, should we consider adding
a new state?
- when launching the container, we need to cleanupPreviousContainerFiles as done in ContainerRelaunch,
right?

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -----------------------------------------------------------------------------
>
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, YARN-5620.003.patch, YARN-5620.004.patch,
YARN-5620.005.patch, YARN-5620.006.patch, YARN-5620.007.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to support upgrade
of a running container with a new {{ContainerLaunchContext}} as well as the ability to rollback
the upgrade if the container is not able to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message