Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D8B6411696 for ; Tue, 29 Jul 2014 00:15:40 +0000 (UTC) Received: (qmail 8349 invoked by uid 500); 29 Jul 2014 00:15:39 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 8303 invoked by uid 500); 29 Jul 2014 00:15:39 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 8289 invoked by uid 99); 29 Jul 2014 00:15:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Jul 2014 00:15:39 +0000 Date: Tue, 29 Jul 2014 00:15:39 +0000 (UTC) From: "Jason Lowe (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077178#comment-14077178 ] Jason Lowe commented on YARN-1354: ---------------------------------- Thanks for taking a look, Junping! bq. what would happen if storeApplication(), finishApplication(), removeApplication() failed with application related information get inconsistent after restart? If storeApplication fails then it will throw an IOException which will bubble up and fail the container start request on the client. As long as we're unable to store a new application, containers for that application will not start, which I believe is the desired behavior. That prevents the state store from being inconsistent in this particular scenario. If finishApplication fails then the NM will proceed as if it did succeed but the state store will still have the application present. This should be corrected when the NM restarts and registers with the RM with those applications still running. The RM should correct the situation by telling the NM that the application has finished (see YARN-1885), and the NM will proceed to perform application finish processing (e.g.: log aggregation, etc.). I think worst-case it will upload all of the app container logs again, but when it goes to rename to the final destination name that will fail because the name already exists. Thus there could be some wasted work, but it should sort itself out and not do something catastrophic. If removeApplication fails then the NM will proceed as if it did succeed but the state store will still have the application present. This should be corrected when the NM finishes application processing (per above or if it was already recorded as finished) and it will again try to remove it from the state store. As above I think there could be some unnecessary work performed, but I think in the end the application should eventually be removed from the NM on restart. It could still remain in the state store if the second removal also fails, but a subsequent restart should behave the same. bq. Do we need special warning if get failed on deserializing credential here? I'm not sure how credential processing is fundamentally all that different from protocol buffer parsing which could also fail. If the credentials can't be read then we can't recover the application. Currently recovery errors are fatal to NM startup. Do you have something specific in mind for handling the credentials if the writable changes (e.g.: some pseudo code to show the approach)? > Recover applications upon nodemanager restart > --------------------------------------------- > > Key: YARN-1354 > URL: https://issues.apache.org/jira/browse/YARN-1354 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.3.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Attachments: YARN-1354-v1.patch, YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, YARN-1354-v4.patch, YARN-1354-v5.patch > > > The set of active applications in the nodemanager context need to be recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)