hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun Suresh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
Date Fri, 09 Sep 2016 18:48:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477938#comment-15477938
] 

Arun Suresh commented on YARN-5620:
-----------------------------------

[~jianhe], thanks for thoughtful comments.

bq. Also, if it’s ignored here, then it appears to AM that the upgrade call is somehow ignored
So in the {{ReInitializeContainerTransition}}, I do the following
{noformat}
if (container.reInitContext != null) {
   container.addDiagnostics("Container [" + container.getContainerId()
      + "] ReInitialization already in progress !!");
  return;
}
{noformat}
So a new upgrade is not ignored..

bq.  I think we can move the logic of ReInitializeContainerTransition to the upgrade API ?
All it does is to resend the events to containerLauncher or ResourceLocalizationService ,
which can be done in the API. Also, this has the benefit of rejecting the upgrade call while
the previous upgrade is in_progress. 
Actually, in one of my initial versions, I was setting the reInitContext on the container
object in the upgrade API itself. But then I realized that all state change on the Container
is affected via a state machine transition and I did not want to break that. I think we should
keep the state change of a Container limited to state machine transitions as the code is cleaner
and we can guard against other synchronization issues.

Also, I think we should have a consistent way to letting the AM know about errors. I prefer
we use the diagnostic messages (with maybe an error code) to let the AM know if failures.
I guess you are already doing it for Failed Localization. This way, the checks and enforcements
are done only in transitions. I agree the AM will not know immediately of an error, but subsequent
'getContainerStatus' calls should notify the AM.

bq. One other potential race is that the relocalize API has a chance to go through instead
of rejected, while upgrading,
I understand.. I was actually thinking we have a set of *reInitilizingContainerId* in the
NM context. we can check if containerId is present in the set before performing upgrade and
relocalize and we can remove from this set once the Container moves from LOCALIZED to RUNNING
(this signifies that reinitialization is complete)

bq. Given so many if(reinitializing) conditions in containerImpl, should we consider adding
a new state?
I tried that at first... but I preferred having more if thens than extra Container states.
I can provide a version with different container states though... 

bq. when launching the container, we need to cleanupPreviousContainerFiles as done in ContainerRelaunch,
right?
This is actually something I wanted to discuss. Do you think we should cleanup the old script
file  ? If the upgrade uses the same script file name, it will be overwritten right ? the
token file is anyway overwritten right ?


> Core changes in NodeManager to support for upgrade and rollback of Containers
> -----------------------------------------------------------------------------
>
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, YARN-5620.003.patch, YARN-5620.004.patch,
YARN-5620.005.patch, YARN-5620.006.patch, YARN-5620.007.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to support upgrade
of a running container with a new {{ContainerLaunchContext}} as well as the ability to rollback
the upgrade if the container is not able to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message