Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D0434200CA4 for ; Wed, 24 May 2017 04:05:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CEF17160BD3; Wed, 24 May 2017 02:05:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 217BE160BC3 for ; Wed, 24 May 2017 04:05:09 +0200 (CEST) Received: (qmail 29232 invoked by uid 500); 24 May 2017 02:05:09 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 29221 invoked by uid 99); 24 May 2017 02:05:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 May 2017 02:05:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B29B6190D5A for ; Wed, 24 May 2017 02:05:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id b5yoI3xYHXlk for ; Wed, 24 May 2017 02:05:07 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 520B65F23E for ; Wed, 24 May 2017 02:05:07 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 85005E0D77 for ; Wed, 24 May 2017 02:05:06 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id F37CD21B5A for ; Wed, 24 May 2017 02:05:04 +0000 (UTC) Date: Wed, 24 May 2017 02:05:04 +0000 (UTC) From: "Rohith Sharma K S (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-6555) Enable flow context read (& corresponding write) for recovering application with NM restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 24 May 2017 02:05:11 -0000 [ https://issues.apache.org/jira/browse/YARN-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022203#comment-16022203 ] Rohith Sharma K S commented on YARN-6555: ----------------------------------------- bq. Do you think we should preserve as much flow context information as possible? The patch only stores flow context in the state store only if all three fields of flow context is present. We could sanitize the flow context and fill in default values for whatever field is missing and then just check if flowcontext !=null before storing application state There are 2 cents. # IMO, we should NOT set default values for flow context. There are 2 cases, ## Master container launched : RM sets flow context in container launch context and start it. This required to be recovered during NM restart. ## AM launches containers : Flow context details are not set. So, it is not required to store and recover during NM restart and no use also. # additional null check for strings before creating a proto is because setter method for strings in proto throws NPE if flowName or flowVersion are null. bq. FlowContext.toString(). Can we do something like {k1=v1, k2=v2, k3=v3} for better readability in the log? make sense, I will change it next patch after Vrushal review it. > Enable flow context read (& corresponding write) for recovering application with NM restart > -------------------------------------------------------------------------------------------- > > Key: YARN-6555 > URL: https://issues.apache.org/jira/browse/YARN-6555 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Affects Versions: YARN-5355, YARN-5355-branch-2, 3.0.0-alpha3 > Reporter: Vrushali C > Assignee: Rohith Sharma K S > Attachments: YARN-6555.001.patch, YARN-6555.002.patch > > > If timeline service v2 is enabled and NM is restarted with recovery enabled, then NM fails to start and throws an error as "flow context can't be null". > This is happening because the flow context did not exist before but now that timeline service v2 is enabled, ApplicationImpl expects it to exist. > This would also happen even if flow context existed before but since we are not persisting it / reading it during ContainerManagerImpl#recoverApplication, it does not get passed in to ApplicationImpl. > full stack trace > {code} > 2017-05-03 21:51:52,178 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager > java.lang.IllegalArgumentException: flow context cannot be null > at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.(ApplicationImpl.java:104) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.(ApplicationImpl.java:90) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverApplication(ContainerManagerImpl.java:318) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:280) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:267) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:276) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:588) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:649) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org