Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E5CCA1051B for ; Wed, 11 Feb 2015 21:07:15 +0000 (UTC) Received: (qmail 11576 invoked by uid 500); 11 Feb 2015 21:07:15 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 11519 invoked by uid 500); 11 Feb 2015 21:07:15 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 11507 invoked by uid 99); 11 Feb 2015 21:07:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Feb 2015 21:07:15 +0000 Date: Wed, 11 Feb 2015 21:07:15 +0000 (UTC) From: "Jason Lowe (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-914) Support graceful decommission of nodemanager MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316980#comment-14316980 ] Jason Lowe commented on YARN-914: --------------------------------- Thanks for updating the doc, Junping. Additional comments: Nit: How about DECOMMISSIONING instead of DECOMMISSION_IN_PROGRESS? The design says when a node starts decommissioning we will remove its resources from the cluster, but that's not really the case, correct? We should remove its available (not total) resources from the cluster then continue to remove available resources as containers complete on that node. Failing to do so will result in weird metrics like more resources running on the cluster than the cluster says it has, etc. Are we only going to support graceful decommission via updates to the include/exclude files and refresh? Not needed for the initial cut, but thinking of a couple of use-cases and curious what others thought: * Would be convenient to have an rmadmin command that does this in one step, especially for a single-node. Arguably if we are persisting cluster nodes in the state store we can migrate the list there, and the include/exclude list simply become convenient ways to batch-update the cluster state. * Will NMs be able to request a graceful decommission via their health check script? There have been some cases in the past where it would have been nice for the NM to request a ramp-down on containers but not instantly kill all of them with an UNHEALTHY report. As for the UI changes, initial thought is that decommissioning nodes should still show up in the active nodes list since they are still running containers. A separate decommissioning tab to filter for those nodes would be nice, although I suppose users can also just use the jquery table to sort/search for nodes in that state from the active nodes list if it's too crowded to add yet another node state tab (or maybe get rid of some effectively dead tabs like the reboot state tab). For the NM restart open question, this should no longer an issue now that the NM is unaware of graceful decommission All the RM needs to do is ensure that a node that is rejoining the cluster when the RM thought it was already part of it retains its previous running/decommissioning state. That way if an NM is decommissioning before the restart it will continue to decommission after it restarts. For the AM dealing with being notified of decommissioning, again I think this should just be treated like a strict preemption for the short term. IMHO all the AM needs to know is that the RM is planning on taking away those containers, and what the AM should do about it is similar whether the reason for removal is preemption or decommissioning. Back to the long running services delaying decommissioning concern, does YARN even know the difference between a long-running container and a "normal" container? If it doesn't, how is it supposed to know a container is not going to complete anytime soon? Even a "normal" container could run for many hours. It seems to me the first thing we would need before worrying about this scenario is the ability for YARN to know/predict the expected runtime of containers. There's still an open question about tracking the timeout RM side instead of NM side. Sounds like the NM side is not going to be pursued at this point, and we're going with no built-in timeout support in YARN for the short-term. > Support graceful decommission of nodemanager > -------------------------------------------- > > Key: YARN-914 > URL: https://issues.apache.org/jira/browse/YARN-914 > Project: Hadoop YARN > Issue Type: Improvement > Affects Versions: 2.0.4-alpha > Reporter: Luke Lu > Assignee: Junping Du > Attachments: Gracefully Decommission of NodeManager (v1).pdf, Gracefully Decommission of NodeManager (v2).pdf > > > When NMs are decommissioned for non-fault reasons (capacity change etc.), it's desirable to minimize the impact to running applications. > Currently if a NM is decommissioned, all running containers on the NM need to be rescheduled on other NMs. Further more, for finished map tasks, if their map output are not fetched by the reducers of the job, these map tasks will need to be rerun as well. > We propose to introduce a mechanism to optionally gracefully decommission a node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)