hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails
Date Fri, 04 Dec 2015 14:07:11 GMT

     [ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Varun Vasudev updated YARN-4309:
    Attachment: YARN-4309.006.patch

Uploaded a new patch to address [~sidharta-s]'s comments.

[~leftnoteasy] - 
bq. Since debug information fetch script (like copy script and list files) is at the end of
launch_container.sh, is it possible that a container is killed so such script cannot be executed?

It's not at the end - it's just before the actually container process is launched so if we
reach a stage where we are ready to call launch_container.sh it should almost always be run.
This is what the relevant lines from launch_container.sh look like with the patch:

echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/directory.info"
exec /bin/bash -c "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp -Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog
 -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/stdout

> Add debug information to application logs when a container fails
> ----------------------------------------------------------------
>                 Key: YARN-4309
>                 URL: https://issues.apache.org/jira/browse/YARN-4309
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-4309.001.patch, YARN-4309.002.patch, YARN-4309.003.patch, YARN-4309.004.patch,
YARN-4309.005.patch, YARN-4309.006.patch
> Sometimes when a container fails, it can be pretty hard to figure out why it failed.
> My proposal is that if a container fails, we collect information about the container
local dir and dump it into the container log dir. Ideally, I'd like to tar up the directory
entirely, but I'm not sure of the security and space implications of such a approach. At the
very least, we can list all the files in the container local dir, and dump the contents of
launch_container.sh(into the container log dir).
> When log aggregation occurs, all this information will automatically get collected and
make debugging such failures much easier.

This message was sent by Atlassian JIRA

View raw message