hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery
Date Tue, 06 Oct 2015 13:28:27 GMT

    [ https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945020#comment-14945020

Jason Lowe commented on YARN-4216:

If we're decommissioning a node then we're not doing a rolling upgrade of it.  Decomm of a
node should kill all of the containers on the node, upload the logs, then shutdown the node.
 That's not a rolling upgrade since we lose work.  It may be rolling in the sense that we
can go through the nodes in a serial fashion, but since work is being lost at each step it's
significantly different than the rolling upgrade with work-preserving restart.

What we're talking about here is reinsertion of a previously decomm'd node that ends up running
containers for an application that already had logs aggregated which is slightly different
than the JIRA title which implies work-preserving restart.  Having the NM append the new logs
would be a reasonable approach to try to avoid log loss, although there's the problem of active
readers for the logs.  If we're appending then we can end up with partially written logs at
the end when readers come along to parse the logs.  We'd either have to live with that possibility
or have the NM copy the existing logs to the .tmp file before appending the new logs then
atomically replacing the previous logs with the new version.  Not all filesystems support
atomic replace, but HDFS can do it.

> Container logs not shown for newly assigned containers  after NM  recovery
> --------------------------------------------------------------------------
>                 Key: YARN-4216
>                 URL: https://issues.apache.org/jira/browse/YARN-4216
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation, nodemanager
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets uploaded
with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts

This message was sent by Atlassian JIRA

View raw message