hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails
Date Tue, 08 Dec 2015 15:47:11 GMT

     [ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Varun Vasudev updated YARN-4309:
    Attachment: YARN-4309.009.patch

Thanks for the reviews [~ivanmi] and [~leftnoteasy].

bq. I think it would be helpful to document what the methods are supposed to do.


bq. Do we want to remove the error check above, to be consistent with Linux, and to avoid
failing due to a "logging" failure? Also, cp command does not exist on Windows. Please use
copy instead. 


bq.  Why do you have both "dir" and "dir /AL /S" on Windows? Can you please include an inline
comment with rationale.

The original intent was to try to detect broken symlinks but I'm not sure if that's possible
using the dir command. I've removed the dir /AL /S command.

bq. In copyDebugInformation() you are also doing a chmod() internally. Wondering this this
command should be injected by the call site given that only the caller has context on what
the destination is and whether special permission handling is needed. It might be possible
to change the method to only accept src and copy the file to the current folder, in which
case it might be fine to use chmod() given that there is an assumption on what the current
folder is. Just a thought you make the call.

Good point. Copying the file to the current folder doesn't work because the container launch
script runs in the container work dir and we want these files to be uploaded as part of log
aggregation. I've just added a check to make sure the path is absolute before attempting the

bq. I meant to print comment to the generated container_launch.sh for better readability.
Such as:


> Add debug information to application logs when a container fails
> ----------------------------------------------------------------
>                 Key: YARN-4309
>                 URL: https://issues.apache.org/jira/browse/YARN-4309
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-4309.001.patch, YARN-4309.002.patch, YARN-4309.003.patch, YARN-4309.004.patch,
YARN-4309.005.patch, YARN-4309.006.patch, YARN-4309.007.patch, YARN-4309.008.patch, YARN-4309.009.patch
> Sometimes when a container fails, it can be pretty hard to figure out why it failed.
> My proposal is that if a container fails, we collect information about the container
local dir and dump it into the container log dir. Ideally, I'd like to tar up the directory
entirely, but I'm not sure of the security and space implications of such a approach. At the
very least, we can list all the files in the container local dir, and dump the contents of
launch_container.sh(into the container log dir).
> When log aggregation occurs, all this information will automatically get collected and
make debugging such failures much easier.

This message was sent by Atlassian JIRA

View raw message