hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith Sharma K S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6555) Enable flow context read (& corresponding write) for recovering application with NM restart
Date Thu, 25 May 2017 05:10:04 GMT

    [ https://issues.apache.org/jira/browse/YARN-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024202#comment-16024202

Rohith Sharma K S commented on YARN-6555:

bq. Rohith, should I rebase my patch on YARN-6323 after this one goes in? Then I can create
a default flow context if the null check at 386 in ContainerManagerImpl.java fails.
Seems like both this and YARN-6323 JIRA are overlapping a bit. Shall we commit this patch
first which does only recovery of FlowContext. Later we shall discuss about upgrade scenario?
This makes us to think about the different cases. 

bq. In buildAppProto at lines 986 onwards in ContainerManagerImpl.java, should those be done
only if ATSv2 is enabled?
Its not required to check again that ATSv2 is enabled or not because if ATSv2 is not enabled,
flowContext will be null. So, even FlowContext proto does not build. 

bq. At line 1041, in startContainerInternal in ContainerManagerImpl.java, just trying to understand
why these were moved.
Oh, sorry I did not explain this while attaching a patch. It is just an optimization. It is
not required to create FlowContext object every time if application is already added into

> Enable flow context read (& corresponding write) for recovering application with
NM restart 
> --------------------------------------------------------------------------------------------
>                 Key: YARN-6555
>                 URL: https://issues.apache.org/jira/browse/YARN-6555
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-5355, YARN-5355-branch-2, 3.0.0-alpha3
>            Reporter: Vrushali C
>            Assignee: Rohith Sharma K S
>         Attachments: YARN-6555.001.patch, YARN-6555.002.patch
> If timeline service v2 is enabled and NM is restarted with recovery enabled, then NM
fails to start and throws an error as  "flow context can't be null".
> This is happening because the flow context did not exist before but now that timeline
service v2 is enabled, ApplicationImpl expects it to exist. 
> This would also happen even if flow context existed before but since we are not persisting
it / reading it during ContainerManagerImpl#recoverApplication, it does not get passed in
to ApplicationImpl.
> full stack trace
> {code}
> 2017-05-03 21:51:52,178 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager:
Error starting NodeManager
> java.lang.IllegalArgumentException: flow context cannot be null
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.<init>(ApplicationImpl.java:104)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.<init>(ApplicationImpl.java:90)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverApplication(ContainerManagerImpl.java:318)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:280)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:267)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:276)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:588)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:649)
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message