Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5BB82200B88 for ; Wed, 7 Sep 2016 23:18:24 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5A87A160ABF; Wed, 7 Sep 2016 21:18:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 84C40160AD2 for ; Wed, 7 Sep 2016 23:18:23 +0200 (CEST) Received: (qmail 79901 invoked by uid 500); 7 Sep 2016 21:18:22 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 79676 invoked by uid 99); 7 Sep 2016 21:18:22 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Sep 2016 21:18:22 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 3D44E2C1B8E for ; Wed, 7 Sep 2016 21:18:22 +0000 (UTC) Date: Wed, 7 Sep 2016 21:18:22 +0000 (UTC) From: "Arun Suresh (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 07 Sep 2016 21:18:24 -0000 [ https://issues.apache.org/jira/browse/YARN-5620?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D15471= 809#comment-15471809 ]=20 Arun Suresh edited comment on YARN-5620 at 9/7/16 9:17 PM: ----------------------------------------------------------- Thanks for the review [~jianhe] bq. The COMMIT_UPGRADE API: I don=E2=80=99t quite get the necessity of this= API. Could you explain under what scenario should the user call this API ? Consider an AM that upgrades a container with a new binary and the process = is subsequently restarted. Now after say around 10 mins the process dies. T= here is no way form the NM to know if the process died because of the upgra= de (memory leak ?) or due to some transient failure.. and therefore it cann= ot make the decision to Retry the process or Rollback the upgrade. Only the= AM knows if the upgrade is actually successful. Essentially, the commit AP= I should be used by the AM to notify the NM that upgrade is fine and any su= bsequent failure can be handled by the existing Retry Policy AFTER it has p= erformed some upgrade diagnostics on the container. We can provide an *auto= Commit* convenience method that clubs upgrade + commit. But I feel it is im= portant we keep the explicit commit API. bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any = previous version, rather than only the immediate previous one. In some sens= e, it=E2=80=99s the same as upgrade. I agree AM should be able to move to any previous version, but, # I feel the versioning should NOT be managed by the NM, since a) the launc= h context is provided and managed by the AM, the AM should take care of tyi= ng the context with the version b) There are (possibly huge) storage implic= ations the NM would have to deal with to keep track of all the earlier vers= ions. # It should not be called *rollback*. The AM should call {{upgradeContainer= (launchContext)}} with some previous context.=20 bq. IMHO, we probably can use one API restartContainer(context) for both up= grade and downgrade I agree that both *rollback* (explicit rollback via API) and *upgrade* can = be implemented as wrappers over {{restartContainer(launchContext)}}. But, i= n my opinion *rollback* should not be provided with an _explicit_ launchCon= text, it should always be the just previous context. bq. Also, Forcing containers to be restarted with previous version if upgra= de fails may not be suitable in all cases, User wants to troubleshoot the f= ailure first before triggering a new wave of restarts. Agreed... I can include an UpgradePolicy which allows users to *terminate* = or *rollBack* (implicit rollback) on failure. Also COMMIT is useful here if= the user wants to verify if one wave has successfully upgraded, commit upg= rade in those instances and then move on to the next wave. bq. IMO, as first cut implementation, we can fail the container if upgrade = fails. we can add retry, rollback, or release the container as RetryPolicy= on failure later. your opinion ? Yup.. will include a policy, as I mentioned above. Don't think *retry* make= s sense though. was (Author: asuresh): Thanks for the review [~jianhe] bq. The COMMIT_UPGRADE API: I don=E2=80=99t quite get the necessity of this= API. Could you explain under what scenario should the user call this API ? Consider an AM that upgrades a container with a new binary and the process = is subsequently restarted. Now after say around 10 mins the process dies. T= here is no way form the NM to know if the process died because of the upgra= de (memory leak ?) or due to some transient failure.. and therefore it cann= ot make the decision to Retry the process or Rollback the upgrade. Only the= AM knows if the upgrade is actually successful. Essentially, the commit AP= I should be used by the AM to notify the NM that upgrade is fine and any su= bsequent failure can be handled by the existing Retry Policy AFTER it has p= erformed some upgrade diagnostics on the container. We can provide an *auto= Commit* convenience method that clubs upgrade + commit. But I feel it is im= portant we keep the explicit commit API. bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any = previous version, rather than only the immediate previous one. In some sens= e, it=E2=80=99s the same as upgrade. I agree AM should be able to move to any previous version, but, # I feel the versioning should NOT be managed by the NM, since a) the launc= h context is provided and managed by the AM, the AM should take care of tyi= ng the context with the version b) There are (possibly huge) storage implic= ations the NM would have to deal with to keep track of all the earlier vers= ions. # It should not be called *rollback*. The AM should call {{restartContainer= (launchContext)}} with some previous context.=20 bq. IMHO, we probably can use one API restartContainer(context) for both up= grade and downgrade I agree that both *rollback* (explicit rollback via API) and *upgrade* can = be implemented as wrappers over {{restartContainer(launchContext)}}. But, i= n my opinion *rollback* should not be provided with an _explicit_ launchCon= text, it should always be the just previous context. > Core changes in NodeManager to support for upgrade and rollback of Contai= ners > -------------------------------------------------------------------------= ---- > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Arun Suresh > Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, YARN-5620.= 003.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to = support upgrade of a running container with a new {{ContainerLaunchContext}= } as well as the ability to rollback the upgrade if the container is not ab= le to restart using the new launch Context.=20 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org