Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 7 Sep 2016 21:18:22 +0000 (UTC)
From: "Arun Suresh (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.13003072.1473202387000.512595.1473283102248@Atlassian.JIRA>
In-Reply-To: <JIRA.13003072.1473202387000@Atlassian.JIRA>
References: <JIRA.13003072.1473202387000@Atlassian.JIRA> <JIRA.13003072.1473202387957@arcas>
Subject: [jira] [Comment Edited] (YARN-5620) Core changes in NodeManager to
 support for upgrade and rollback of Containers
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Wed, 07 Sep 2016 21:18:24 -0000


    [ https://issues.apache.org/jira/browse/YARN-5620?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D15471=
809#comment-15471809 ]=20

Arun Suresh edited comment on YARN-5620 at 9/7/16 9:17 PM:
-----------------------------------------------------------

Thanks for the review [~jianhe]

bq. The COMMIT_UPGRADE API: I don=E2=80=99t quite get the necessity of this=
 API. Could you explain under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process =
is subsequently restarted. Now after say around 10 mins the process dies. T=
here is no way form the NM to know if the process died because of the upgra=
de (memory leak ?) or due to some transient failure.. and therefore it cann=
ot make the decision to Retry the process or Rollback the upgrade. Only the=
 AM knows if the upgrade is actually successful. Essentially, the commit AP=
I should be used by the AM to notify the NM that upgrade is fine and any su=
bsequent failure can be handled by the existing Retry Policy AFTER it has p=
erformed some upgrade diagnostics on the container. We can provide an *auto=
Commit* convenience method that clubs upgrade + commit. But I feel it is im=
portant we keep the explicit commit API.

bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any =
previous version, rather than only the immediate previous one. In some sens=
e, it=E2=80=99s the same as upgrade.
I agree AM should be able to move to any previous version, but,
# I feel the versioning should NOT be managed by the NM, since a) the launc=
h context is provided and managed by the AM, the AM should take care of tyi=
ng the context with the version b) There are (possibly huge) storage implic=
ations the NM would have to deal with to keep track of all the earlier vers=
ions.
# It should not be called *rollback*. The AM should call {{upgradeContainer=
(launchContext)}} with some previous context.=20


bq. IMHO, we probably can use one API restartContainer(context) for both up=
grade and downgrade
I agree that both *rollback* (explicit rollback via API) and *upgrade* can =
be implemented as wrappers over {{restartContainer(launchContext)}}. But, i=
n my opinion *rollback* should not be provided with an _explicit_ launchCon=
text, it should always be the just previous context.

bq. Also, Forcing containers to be restarted with previous version if upgra=
de fails may not be suitable in all cases, User wants to troubleshoot the f=
ailure first before triggering a new wave of restarts.
Agreed... I can include an UpgradePolicy which allows users to *terminate* =
or *rollBack* (implicit rollback) on failure. Also COMMIT is useful here if=
 the user wants to verify if one wave has successfully upgraded, commit upg=
rade in those instances and then move on to the next wave.

bq. IMO, as first cut implementation, we can fail the container if upgrade =
fails. we can add retry,  rollback, or release the container as RetryPolicy=
 on failure later. your opinion ?
Yup.. will include a policy, as I mentioned above. Don't think *retry* make=
s sense though.


was (Author: asuresh):
Thanks for the review [~jianhe]

bq. The COMMIT_UPGRADE API: I don=E2=80=99t quite get the necessity of this=
 API. Could you explain under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process =
is subsequently restarted. Now after say around 10 mins the process dies. T=
here is no way form the NM to know if the process died because of the upgra=
de (memory leak ?) or due to some transient failure.. and therefore it cann=
ot make the decision to Retry the process or Rollback the upgrade. Only the=
 AM knows if the upgrade is actually successful. Essentially, the commit AP=
I should be used by the AM to notify the NM that upgrade is fine and any su=
bsequent failure can be handled by the existing Retry Policy AFTER it has p=
erformed some upgrade diagnostics on the container. We can provide an *auto=
Commit* convenience method that clubs upgrade + commit. But I feel it is im=
portant we keep the explicit commit API.

bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any =
previous version, rather than only the immediate previous one. In some sens=
e, it=E2=80=99s the same as upgrade.
I agree AM should be able to move to any previous version, but,
# I feel the versioning should NOT be managed by the NM, since a) the launc=
h context is provided and managed by the AM, the AM should take care of tyi=
ng the context with the version b) There are (possibly huge) storage implic=
ations the NM would have to deal with to keep track of all the earlier vers=
ions.
# It should not be called *rollback*. The AM should call {{restartContainer=
(launchContext)}} with some previous context.=20


bq. IMHO, we probably can use one API restartContainer(context) for both up=
grade and downgrade
I agree that both *rollback* (explicit rollback via API) and *upgrade* can =
be implemented as wrappers over {{restartContainer(launchContext)}}. But, i=
n my opinion *rollback* should not be provided with an _explicit_ launchCon=
text, it should always be the just previous context.


> Core changes in NodeManager to support for upgrade and rollback of Contai=
ners
> -------------------------------------------------------------------------=
----
>
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, YARN-5620.=
003.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to =
support upgrade of a running container with a new {{ContainerLaunchContext}=
} as well as the ability to rollback the upgrade if the container is not ab=
le to restart using the new launch Context.=20


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org