hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-4309) Add debug information to application logs when a container fails
Date Mon, 07 Dec 2015 17:25:11 GMT

     [ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Varun Vasudev updated YARN-4309:
--------------------------------
    Attachment: YARN-4309.008.patch

Uploaded a new patch to address comments by [~leftnoteasy].

bq. Could you make sure container process will be launched even if copy script or list folder
command fails?

Fixed.

bq.    Could you add echo command (something like echo "Printing container launch debug info...")
to container_launch.sh?
The echo will end up being captured by the ContainerExecutor and logged in the NM log. Any
particular reason you want to print this line?

bq. Add a test to verify log aggregation result contains such debugging output?
This would require essentially launching a container and waiting for log aggregation to occur.
I'm not sure it will add anything.

bq.    Could you upload a sample container_launch.sh for easier review?

This is using the yarn logs command with the feature enabled -
{code}
LogType:launch_container.sh
Log Upload Time:Mon Dec 07 22:43:44 +0530 2015
LogLength:5042
Log Contents:
#!/bin/bash

export JAVA_HOME=${JAVA_HOME:-"/usr/lib/jvm/java-1.7.0-openjdk-amd64"}
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
"
export NM_HOST="ubuntu"
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/var/hadoop/hadoop-3.0.0-SNAPSHOT"}
export HADOOP_ROOT_LOGGER="INFO,console"
export JVM_PID="$$"
export STDERR_LOGFILE_ENV="/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/stderr"
export PWD="/var/hadoop/hadoop-3-data/grid/local/usercache/varun/appcache/application_1449508378135_0001/container_1449508378135_0001_01_000003"
export NM_PORT="39813"
export LOGNAME="varun"
export MALLOC_ARENA_MAX="4"
export LD_LIBRARY_PATH="$PWD:/var/hadoop/hadoop-3.0.0-SNAPSHOT/lib/native"
export LOG_DIRS="/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003,/var/hadoop/hadoop-3-data/grid2/log/application_1449508378135_0001/container_1449508378135_0001_01_000003"
export NM_HTTP_PORT="8042"
export SHELL="/bin/bash"
export LOCAL_DIRS="/var/hadoop/hadoop-3-data/grid/local/usercache/varun/appcache/application_1449508378135_0001"
export HADOOP_COMMON_HOME=${HADOOP_COMMON_HOME:-"/var/hadoop/hadoop-3.0.0-SNAPSHOT"}
export HADOOP_TOKEN_FILE_LOCATION="/var/hadoop/hadoop-3-data/grid/local/usercache/varun/appcache/application_1449508378135_0001/container_1449508378135_0001_01_000003/container_tokens"
export CLASSPATH="$PWD:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:job.jar/job.jar:job.jar/classes/:job.jar/lib/*:$PWD/*"
export STDOUT_LOGFILE_ENV="/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/stdout"
export USER="varun"
export HADOOP_CLIENT_OPTS="-Xmx512m -Xmx512m  "
export HADOOP_HDFS_HOME=${HADOOP_HDFS_HOME:-"/var/hadoop/hadoop-3.0.0-SNAPSHOT"}
export CONTAINER_ID="container_1449508378135_0001_01_000003"
export HOME="/home/"
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/var/hadoop/hadoop-3.0.0-SNAPSHOT/conf"}
ln -sf "/var/hadoop/hadoop-3-data/grid/local/usercache/varun/appcache/application_1449508378135_0001/filecache/10/job.jar"
"job.jar"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
ln -sf "/var/hadoop/hadoop-3-data/grid/local/usercache/varun/appcache/application_1449508378135_0001/filecache/13/job.xml"
"job.xml"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
cp "launch_container.sh" "/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/launch_container.sh"
chmod 640 "/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/launch_container.sh"
echo "ls -l:" 1>"/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/directory.info"
ls -l 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/directory.info"
find -L . -maxdepth 5 -ls 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/directory.info"
exec /bin/bash -c "$JAVA_HOME/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
  -Xmx820m -Djava.io.tmpdir=$PWD/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog
-Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle
-Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0 org.apache.hadoop.mapred.YarnChild
127.0.1.1 36966 attempt_1449508378135_0001_r_000000_0 3 1>/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/stdout
2>/var/hadoop/hadoop-3-data/grid/log/application_1449508378135_0001/container_1449508378135_0001_01_000003/stderr
"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
End of LogType:launch_container.sh
{code}



> Add debug information to application logs when a container fails
> ----------------------------------------------------------------
>
>                 Key: YARN-4309
>                 URL: https://issues.apache.org/jira/browse/YARN-4309
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-4309.001.patch, YARN-4309.002.patch, YARN-4309.003.patch, YARN-4309.004.patch,
YARN-4309.005.patch, YARN-4309.006.patch, YARN-4309.007.patch, YARN-4309.008.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it failed.
> My proposal is that if a container fails, we collect information about the container
local dir and dump it into the container log dir. Ideally, I'd like to tar up the directory
entirely, but I'm not sure of the security and space implications of such a approach. At the
very least, we can list all the files in the container local dir, and dump the contents of
launch_container.sh(into the container log dir).
> When log aggregation occurs, all this information will automatically get collected and
make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message