hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vrushali C (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6323) Rolling upgrade/config change is broken on timeline v2.
Date Fri, 21 Jul 2017 02:25:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095682#comment-16095682

Vrushali C commented on YARN-6323:

Ping on this jira. To summarize:

- new NM fails to recover apps since the timeline flow context is missing for old apps on
the NM. This patch will put in a default flow context to help NM proceed. 

To answer Rohith's questions:

bq Application is NOT submitted with tags. So default values are created by YARN.
RM creates default FlowContext with FlowName as appName. On NM restart, we are creating FlowContex
with appId. So, there will be a inconsistencies when entities are published during rolling
Yes, inconsistencies would be there but it is not possible to upgrade the RM and the all the
NMs at exactly the time, unless we take a downtime. 

bq. Assume that Application is submitted with some tags. RM recover the application and start
publishing with tags as flow context. Again there is inconsistencies in published entity.
Yes, but how to synchronize RM and NM across restarts? We could use app id in both cases but
this turns out to be strange default data.   

This patch will ensure the NM does not fail to start up.  I thought of adding in some default
values for dropping the data but that will be an expensive check to do each time we want to
write to the backend. 

ping [~rohithsharma] [~varun_saxena] [~haibo.chen]  any other ideas? At the very least, the
NM can't be crashing during an upgrade due to missing flow context. 

> Rolling upgrade/config change is broken on timeline v2. 
> --------------------------------------------------------
>                 Key: YARN-6323
>                 URL: https://issues.apache.org/jira/browse/YARN-6323
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Li Lu
>            Assignee: Vrushali C
>              Labels: yarn-5355-merge-blocker
>         Attachments: YARN-6323.001.patch
> Found this issue when deploying on real clusters. If there are apps running when we enable
timeline v2 (with work preserving restart enabled), node managers will fail to start due to
missing app context data. We should probably assign some default names to these "left over"
apps. I believe it's suboptimal to let users clean up the whole cluster before enabling timeline

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message