hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandy Ryza (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-24) Nodemanager fails to start if log aggregation enabled and namenode unavailable
Date Fri, 22 Feb 2013 21:28:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584711#comment-13584711
] 

Sandy Ryza commented on YARN-24:
--------------------------------

I encountered this when trying to start a NM and a namenode at the same time.  The NM shut
down because the namenode was in safe mode.  Having the NM die in this way introduces a dependency
in the order that services are started.

Log aggregation is checked each time an app is run on a node, and the app is immediately killed
if a log folder cannot be used for it.  Thus, merely removing the NM killing itself on startup
doesn't introduce any correctness issues.  The worst that could happen is that time could
be wasted by scheduling more containers on a node we already know has connection issues to
the namenode.

Attached a patch that removes the NM killing itself on startup.  At initApp time, if verifyAndCreateRemoteLogDir
has not been successfully completed, it is called again, and the app is failed if it fails.
 If initApp fails five consecutive times, the NM sets its status to unhealthy.

I agree if an NM loses its ability to connect to the namenode after an app has started, it
would be good for the NMs to report that they weren't able to write their logs, but my opinion
is that that is a more difficult issue and does not need to be tied to this change. 
                
> Nodemanager fails to start if log aggregation enabled and namenode unavailable
> ------------------------------------------------------------------------------
>
>                 Key: YARN-24
>                 URL: https://issues.apache.org/jira/browse/YARN-24
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3, 2.0.0-alpha
>            Reporter: Jason Lowe
>         Attachments: YARN-24.patch
>
>
> If log aggregation is enabled and the namenode is currently unavailable, the nodemanager
fails to startup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message