hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun Suresh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
Date Wed, 07 Sep 2016 21:08:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471809#comment-15471809
] 

Arun Suresh edited comment on YARN-5620 at 9/7/16 9:07 PM:
-----------------------------------------------------------

Thanks for the review [~jianhe]

bq. The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could you explain
under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process is subsequently
restarted. Now after say around 10 mins the process dies. There is no way form the NM to know
if the process died because of the upgrade (memory leak ?) or due to some transient failure..
and therefore it cannot make the decision to Retry the process or Rollback the upgrade. Only
the AM knows if the upgrade is actually successful. Essentially, the commit API should be
used by the AM to notify the NM that upgrade is fine and any subsequent failure can be handled
by the existing Retry Policy AFTER it has performed some upgrade diagnostics on the container.
We can provide an *autoCommit* convenience method that clubs upgrade + commit. But I feel
it is important we keep the explicit commit API.

bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any previous version,
rather than only the immediate previous one. In some sense, it’s the same as upgrade.
I agree AM should be able to move to any previous version, but,
# I feel the versioning should NOT be managed by the NM, since a) the launch context is provided
and managed by the AM, the AM should take care of tying the context with the version b) There
are (possibly huge) storage implications the NM would have to deal with to keep track of all
the earlier versions.
# It should not be called *rollback*. The AM should call {{restartContainer(launchContext)}}
with some previous context. 


bq. IMHO, we probably can use one API restartContainer(context) for both upgrade and downgrade
I agree that both *rollback* (explicit rollback via API) and *upgrade* can be implemented
as wrappers over {{restartContainer(launchContext)}}. But, in my opinion *rollback* should
not be provided with an _explicit_ launchContext, it should always be the just previous context.







was (Author: asuresh):
Thanks for the review [~jianhe]

bq. The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could you explain
under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process is subsequently
restarted. Now after say around 10 mins the process dies. There is no way form the NM to know
if the process died because of the upgrade (memory leak ?) or due to some transient failure..
and therefore it cannot make the decision to Retry the process or Rollback the upgrade. Only
the AM knows if the upgrade is actually successful. Essentially, the commit API should be
used by the AM to notify the NM that upgrade is fine and any subsequent failure can be handled
by the existing Retry Policy AFTER it has performed some upgrade diagnostics on the container.
We can provide an *autoCommit* convenience method that clubs upgrade + commit. But I feel
it is important we keep the explicit commit API.

bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any previous version,
rather than only the immediate previous one. In some sense, it’s the same as upgrade.
I agree AM should be able to move to any previous version, but,
# I feel the versioning should NOT be managed by the NM, since a) the launch context is provided
and managed by the AM, the AM should take care of tying the context with the version b) There
are (possibly huge) storage implications the NM would have to deal with to keep track of all
the earlier versions.
# It should not be called *rollback*. The AM should call {{restartContainer(launchContext)}}
with some previous context. 

bq. IMHO, we probably can use one API restartContainer(context) for both upgrade and downgrade
I agree that both *rollback* (explicit rollback via API) and *upgrade* can be implemented
as wrappers over {{restartContainer(launchContext)}}. But, in my opinion *rollback* should
not be provided with an _explicit_ launchContext, it should always be the just previous context.






> Core changes in NodeManager to support for upgrade and rollback of Containers
> -----------------------------------------------------------------------------
>
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, YARN-5620.003.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to support upgrade
of a running container with a new {{ContainerLaunchContext}} as well as the ability to rollback
the upgrade if the container is not able to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message