hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun Suresh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
Date Wed, 07 Sep 2016 21:18:22 GMT

    [ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471809#comment-15471809
] 

Arun Suresh edited comment on YARN-5620 at 9/7/16 9:17 PM:
-----------------------------------------------------------

Thanks for the review [~jianhe]

bq. The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could you explain
under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process is subsequently
restarted. Now after say around 10 mins the process dies. There is no way form the NM to know
if the process died because of the upgrade (memory leak ?) or due to some transient failure..
and therefore it cannot make the decision to Retry the process or Rollback the upgrade. Only
the AM knows if the upgrade is actually successful. Essentially, the commit API should be
used by the AM to notify the NM that upgrade is fine and any subsequent failure can be handled
by the existing Retry Policy AFTER it has performed some upgrade diagnostics on the container.
We can provide an *autoCommit* convenience method that clubs upgrade + commit. But I feel
it is important we keep the explicit commit API.

bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any previous version,
rather than only the immediate previous one. In some sense, it’s the same as upgrade.
I agree AM should be able to move to any previous version, but,
# I feel the versioning should NOT be managed by the NM, since a) the launch context is provided
and managed by the AM, the AM should take care of tying the context with the version b) There
are (possibly huge) storage implications the NM would have to deal with to keep track of all
the earlier versions.
# It should not be called *rollback*. The AM should call {{upgradeContainer(launchContext)}}
with some previous context. 



bq. IMHO, we probably can use one API restartContainer(context) for both upgrade and downgrade
I agree that both *rollback* (explicit rollback via API) and *upgrade* can be implemented
as wrappers over {{restartContainer(launchContext)}}. But, in my opinion *rollback* should
not be provided with an _explicit_ launchContext, it should always be the just previous context.

bq. Also, Forcing containers to be restarted with previous version if upgrade fails may not
be suitable in all cases, User wants to troubleshoot the failure first before triggering a
new wave of restarts.
Agreed... I can include an UpgradePolicy which allows users to *terminate* or *rollBack* (implicit
rollback) on failure. Also COMMIT is useful here if the user wants to verify if one wave has
successfully upgraded, commit upgrade in those instances and then move on to the next wave.

bq. IMO, as first cut implementation, we can fail the container if upgrade fails. we can add
retry,  rollback, or release the container as RetryPolicy on failure later. your opinion ?
Yup.. will include a policy, as I mentioned above. Don't think *retry* makes sense though.





was (Author: asuresh):
Thanks for the review [~jianhe]

bq. The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could you explain
under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process is subsequently
restarted. Now after say around 10 mins the process dies. There is no way form the NM to know
if the process died because of the upgrade (memory leak ?) or due to some transient failure..
and therefore it cannot make the decision to Retry the process or Rollback the upgrade. Only
the AM knows if the upgrade is actually successful. Essentially, the commit API should be
used by the AM to notify the NM that upgrade is fine and any subsequent failure can be handled
by the existing Retry Policy AFTER it has performed some upgrade diagnostics on the container.
We can provide an *autoCommit* convenience method that clubs upgrade + commit. But I feel
it is important we keep the explicit commit API.

bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any previous version,
rather than only the immediate previous one. In some sense, it’s the same as upgrade.
I agree AM should be able to move to any previous version, but,
# I feel the versioning should NOT be managed by the NM, since a) the launch context is provided
and managed by the AM, the AM should take care of tying the context with the version b) There
are (possibly huge) storage implications the NM would have to deal with to keep track of all
the earlier versions.
# It should not be called *rollback*. The AM should call {{restartContainer(launchContext)}}
with some previous context. 


bq. IMHO, we probably can use one API restartContainer(context) for both upgrade and downgrade
I agree that both *rollback* (explicit rollback via API) and *upgrade* can be implemented
as wrappers over {{restartContainer(launchContext)}}. But, in my opinion *rollback* should
not be provided with an _explicit_ launchContext, it should always be the just previous context.






> Core changes in NodeManager to support for upgrade and rollback of Containers
> -----------------------------------------------------------------------------
>
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, YARN-5620.003.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to support upgrade
of a running container with a new {{ContainerLaunchContext}} as well as the ability to rollback
the upgrade if the container is not able to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message