hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Templeton <dan...@cloudera.com>
Subject Re: Hadoop3.0.0-alpha2-Docker run time launch errors.
Date Wed, 24 May 2017 22:32:57 GMT
There is indeed an issue:

Container id: container_1495657784956_0007_01_000002
Exit code: 7
Exception message: docker: Error response from daemon: oci runtime 
error: container_linux.go:247: starting container process caused "exec: 
\"bash 
/tmp/hadoop-root/nm-local-dir/usercache/daniel/appcache/application_1495657784956_0007/container_1495657784956_0007_01_000002/launch_container.sh

\": stat bash 
/tmp/hadoop-root/nm-local-dir/usercache/daniel/appcache/application_1495657784956_0007/container_1495657784956_0007_01_000002/launch_container.sh

: no such file or directory".
Could not invoke docker docker run 
--name='container_1495657784956_0007_01_000002' --user='daniel' -d 
--workdir='/tmp/hadoop-root/nm-local-dir/usercache/daniel/appcache/application_1495657784956_0007/container_1495657784956_0007_01_000002'

--net='host' --cap-drop='ALL' --cap-add='SYS_CHROOT' --cap-add='MKNOD' 
--cap-add='SETFCAP' --cap-add='SETPCAP' --cap-add='FSETID' 
--cap-add='CHOWN' --cap-add='AUDIT_WRITE' --cap-add='SETGID' 
--cap-add='NET_RAW' --cap-add='FOWNER' --cap-add='SETUID' 
--cap-add='DAC_OVERRIDE' --cap-add='KILL' --cap-add='NET_BIND_SERVICE' 
-v 
'/tmp/hadoop-root/nm-local-dir/usercache/daniel/appcache/application_1495657784956_0007/:/tmp/hadoop-root/nm-local-dir/usercache/daniel/appcache/application_1495657784956_0007/'

-v 
'/tmp/hadoop-root/nm-local-dir/filecache:/tmp/hadoop-root/nm-local-dir/filecache' 
-v 
'/tmp/hadoop-root/nm-local-dir/usercache/daniel/appcache/application_1495657784956_0007/container_1495657784956_0007_01_000002:/tmp/hadoop-root/nm-local-dir/usercache/daniel/appcache/application_1495657784956_0007/container_1495657784956_0007_01_000002'

-v 
'/tmp/logs/application_1495657784956_0007/container_1495657784956_0007_01_000002:/tmp/logs/application_1495657784956_0007/container_1495657784956_0007_01_000002'

-v 
'/tmp/hadoop-root/nm-local-dir/usercache/daniel/:/tmp/hadoop-root/nm-local-dir/usercache/daniel/'

'daniel' 'bash 
/tmp/hadoop-root/nm-local-dir/usercache/daniel/appcache/application_1495657784956_0007/container_1495657784956_0007_01_000002/launch_container.sh

'.

The problem is that the last argument should be broken up into 3 
separate arguments.  Docker is trying to take that whole last arg as a 
path to the thing to run, and it can't find it.  Dropping the quotes 
around the last arg in the docker enables it to run.  This is a new issue.

I see in the email that you just sent that switching to the root user 
resolves the issue for you.  It does not for me.  Are you sure it does 
for you?

Daniel


On 5/24/17 12:35 PM, Jasson Chenwei wrote:
> hi, Daniel
>
> Thanks for your kindly help. Non-docker mode works for me.
>
> Here is my configuration related to docker setting up:
> (1). yarn-site.xml
> /* <property>*/
> /*<name>yarn.nodemanager.container-executor.class</name>*/
> /*<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>*/
> /* </property>*/
> /*
> */
> /* <property>*/
> /*<name>yarn.nodemanager.linux-container-executor.group</name>*/
> /*        <value>cwei</value>*/
> /* </property>*/
> /*
> */
> /* <property>*/
> /*<name>yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users</name>*/
> /*        <value>false</value>*/
> /* </property>*/
>
> (2)/etc/yarn/container-executor.cfg(I built the source code to 
> designate this location)
> /*yarn.nodemanager.linux-container-executor.group=cwei*/
> /*allowed.system.users=cwei,root*/
> /*feature.docker.enabled=1*/
>
> (3)container-executor
> /*---Sr-s--- 1 root cwei 215K May 17 15:00 container-executor
> */
> /*
> */
> /*
> */
> (4)Docker image Dockerfile(I modified it based on 
> sequenceiq/hadoop:latest which uses the hadoop-2.7.1 as code base). 
> The major modifications are: (1) add user cwei to match the 
> linux-container-group (2) download the hadoop-3.0.0-alpha2 as code 
> base (3) set the ENV.  I also tested this image on my machine with a 
> bare bash. The output of JAVA_HOME/HAOOP_PREFIX are correct.
>
> */FROM sequenceiq/pam:centos-6.5/*
> */MAINTAINER cwei/*
> */
> /*
> */USER root/*
> */
> /*
> */RUN useradd -ms /bin/bash cwei/*
> */
> /*
> */
> /*
> */# install dev tools/*
> */RUN yum clean all; \/*
> */    rpm --rebuilddb; \/*
> */    yum install -y curl which tar sudo openssh-server 
> openssh-clients rsync/*
> */# update libselinux. see 
> https://github.com/sequenceiq/hadoop-docker/issues/14/*
> */RUN yum update -y libselinux/*
> */
> /*
> */# passwordless ssh/*
> */RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key/*
> */RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key/*
> */RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa/*
> */RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys/*
> */
> /*
> */
> /*
> */# java/*
> */RUN curl -LO 
> 'http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.rpm'

> -H 'Cookie: oraclelicense=accept-securebackup-cookie'/*
> */RUN rpm -i jdk-8u131-linux-x64.rpm/*
> */RUN rm jdk-8u131-linux-x64.rpm/*
> */
> /*
> */ENV JAVA_HOME /usr/java/default/*
> */ENV PATH $PATH:$JAVA_HOME/bin/*
> */RUN rm /usr/bin/java && ln -s $JAVA_HOME/bin/java /usr/bin/java/*
> */
> /*
> */# download native support/*
> */RUN mkdir -p /tmp/native/*
> */RUN curl -L 
> https://github.com/sequenceiq/docker-hadoop-build/releases/download/v2.7.1/hadoop-native-64-2.7.1.tgz

> | tar -xz -C /tmp/native/*
> */
> /*
> */# hadoop/*
> */RUN curl -s 
> http://www.eu.apache.org/dist/hadoop/common/hadoop-3.0.0-alpha2/hadoop-3.0.0-alpha2.tar.gz

> | tar -xz -C /usr/local//*
> */RUN cd /usr/local && ln -s ./hadoop-3.0.0-alpha2 hadoop/*
> */
> /*
> */ENV HADOOP_PREFIX /usr/local/hadoop/*
> */ENV HADOOP_COMMON_HOME /usr/local/hadoop/*
> */ENV HADOOP_HDFS_HOME /usr/local/hadoop/*
> */ENV HADOOP_MAPRED_HOME /usr/local/hadoop/*
> */ENV HADOOP_YARN_HOME /usr/local/hadoop/*
> */ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop/*
> */ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop/*
>
> /*
> */
> /*
> */
> (5)job submission:
> /home/cwei/project/hadoop-3.0.0-alpha2/bin/hadoop --config 
> /home/cwei/project/hadoop-3.0.0-alpha2/etc/hadoop jar 
> /home/cwei/project/hadoop-3.0.0-alpha2/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha3-SNAPSHOT.jar

> randomtextwriter -D mapreduce.randomtextwriter.totalbytes=3200000000 
> -D mapreduce.randomtextwriter.bytespermap=266666666 
> *-Dmapreduce.map.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=cwei/hadoop:3.0.0"

> -Dyarn.app.mapreduce.am.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=cwei/hadoop:3.0.0"

>  -Dmapreduce.reduce.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=cwei/hadoop:3.0.0"*

>  -D mapreduce.job.maps=12 -D mapreduce.job.reduces=12 
> hdfs://disco-0021:9000/HiBench/Wordcount/Input
>
>
> (6) The output log for this mapreduce job:
> */
> /*
> */Job started: Tue May 23 15:53:58 MDT 2017/*
> */2017-05-23 15:53:58,861 INFO client.RMProxy: Connecting to 
> ResourceManager at disco-0021/128.198.180.88:8032 
> <http://128.198.180.88:8032>/*
> */2017-05-23 15:54:00,807 INFO mapreduce.JobSubmitter: number of 
> splits:12/*
> */2017-05-23 15:54:01,384 INFO mapreduce.JobSubmitter: Submitting 
> tokens for job: job_1495576396036_0001/*
> */2017-05-23 15:54:02,117 INFO impl.YarnClientImpl: Submitted 
> application application_1495576396036_0001/*
> */2017-05-23 15:54:02,182 INFO mapreduce.Job: The url to track the 
> job: http://disco-0021:8088/proxy/application_1495576396036_0001//*
> */2017-05-23 15:54:02,184 INFO mapreduce.Job: Running job: 
> job_1495576396036_0001/*
> */2017-05-23 15:54:12,300 INFO mapreduce.Job: Job 
> job_1495576396036_0001 running in uber mode : false/*
> */2017-05-23 15:54:12,302 INFO mapreduce.Job:  map 0% reduce 0%/*
> */2017-05-23 15:54:12,321 INFO mapreduce.Job: Job 
> job_1495576396036_0001 failed with state FAILED due to: Application 
> application_1495576396036_0001 failed 2 times due to AM Container for 
> appattempt_1495576396036_0001_000002 exited with  exitCode: -1/*
> */Failing this attempt.Diagnostics: For more detailed output, check 
> the application tracking page: 
> http://disco-0021:8088/cluster/app/application_1495576396036_0001 Then 
> click on links to logs of each attempt./*
> */. Failing the application./*
> */2017-05-23 15:54:12,349 INFO mapreduce.Job: Counters: 0/*
> */Job ended: Tue May 23 15:54:12 MDT 2017/*
> */The job took 13 seconds./*
>
>
>
> (7) I checked docker launched container with command /docker ps -a/. I 
> found the container assigned to job AppMaster only launched for 5 
> seconds. It means that container-executor successfully pass the 
> privilege checking and launches the container, but ends up with 
> failing docker command itself.
>
>
> (8) The docker command could be found at yarn-temp/nm-docker-cmds, 
> clearly its workdir is 
> */home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_**/1495576396036_0001/**/container_**/1495576396036_0001/**_01_000001.

> *
> Thus, the launch_container.sh should be in this folder so that the 
> bash command wrapped in docker run command would locate it.
>
> r*un --name=container_**/1495576396036/**_0001_01_000001 --user=cwei 
> -d 
> --workdir=/home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_**/1495576396036/**_0001/container_**/1495576396036/**_0001_01_000001

> --net=host --cap-drop=ALL --cap-add=SYS_CHROOT --cap-add=MKNOD 
> --cap-add=SETFCAP --cap-add=SETPCAP --cap-add=FSETID --cap-add=CHOWN 
> --cap-add=AUDIT_WRITE --cap-add=SETGID --cap-add=NET_RAW 
> --cap-add=FOWNER --cap-add=SETUID --cap-add=DAC_OVERRIDE 
> --cap-add=KILL --cap-add=NET_BIND_SERVICE -v 
> /sys/fs/cgroup:/sys/fs/cgroup:ro -v 
> /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_**/1495576396036/**_0001/:/home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_**/1495576396036/**_0001/

> -v 
> /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/filecache:/home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/filecache

> -v 
> /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_**/1495576396036/**_0001/container_**/1495576396036/**_0001_01_000001:/home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_**/1495576396036/**_0001/container_**/1495576396036/**_0001_01_000001

> -v 
> /home/cwei/project/hadoop-3.0.0-alpha2/logs/userlogs/application_**/1495576396036/**_0001/container_**/1495576396036/**_0001_01_000001:/home/cwei/project/hadoop-3.0.0-alpha2/logs/userlogs/application_**/1495576396036/**_0001/container_**/1495576396036/**_0001_01_000001

> -v 
> /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/:/home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/

> cwei/hadoop:3.0.0 bash 
> /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_**/1495576396036/**_0001/container_**/1495576396036/**_0001_01_000001/launch_container.sh*
>
>
> (9) However, after checking the source code and adding some logs info. 
> I found that:
> 1. container work dir is 
> *//home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_1495576396036_0001/container_1495576396036_0001_01_000001

> /*which corresponds to the --workdir parameter in docker command.
> 2. But the ContainerLaunch write out the launch_container.sh 
> to/*/home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/nmPrivate/application_1495576396036_0001/container_1495576396036*//*_0001_01_000001/launch_*/*/container.sh/.

> *As a result, the container run command can not locate this launch 
> script and ends up with errors.
>
>
>
>
> (10) I also add Thread.sleep() code during container launching and 
> verify above.
>
>
> If you need extra info, please let me know.
>
>
>
>
> Thanks,
>
>
>
> Wei
>
> On Wed, May 24, 2017 at 12:12 PM, Daniel Templeton 
> <daniel@cloudera.com <mailto:daniel@cloudera.com>> wrote:
>
>     I'm firing up a quick cluster to test the latest trunk.  I'll let
>     you know if I have any issues.
>
>
>     Can you give more details about the cluster config?  Is this an
>     existing cluster where you're turning on Docker support or a new
>     cluster.  Do non-Docker workloads launch correctly?
>
>
>     Daniel
>
>
>     On 5/24/17 11:00 AM, Jasson Chenwei wrote:
>
>         hi, all
>
>         I have problems launching docker container in
>         Hadoop3.0.0-alpha2/3.
>
>         I found applications failed to start during initializing the
>         docker
>         container. That's caused by the docker can not find
>         launch_contaienr.sh in
>         its workdir.
>
>         Here is my log:
>
>
>
>
>
>
>         *2017-05-24 11:03:09,662 INFO
>         org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>         private script path:
>         /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/nmPrivate/application_1495644587391_0001/container_1495644587391_0001_02_000001/launch_container.sh
>         2017-05-24
>         11:03:09,695 INFO
>         org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>         private token path:
>         /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/nmPrivate/application_1495644587391_0001/container_1495644587391_0001_02_000001/container_1495644587391_0001_02_000001.tokens
>         2017-05-24
>         11:03:09,837 INFO
>         org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>         private jar path:
>         /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/nmPrivate/application_1495644587391_0001/container_1495644587391_0001_02_000001
>         2017-05-24
>         11:03:09,876 INFO
>         org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>         Container container_1495644587391_0001_02_000001 transitioned from
>         SCHEDULED to RUNNING 2017-05-24 11:03:09,876 INFO
>         org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>         Starting resource-monitoring for
>         container_1495644587391_0001_02_000001 2017-05-24 11:03:09,876
>         INFO
>         org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
>         container working dir:
>         /home/cwei/project/hadoop-3.0.0-alpha2/yarn-temp/nm-local-dir/usercache/cwei/appcache/application_1495644587391_0001/container_1495644587391_0001_02_000001
>         *
>
>         Since ContainerLaunch will write the launch script as well as
>         token file to
>         */yarn-temp/nm-local-dir/nmPrivate/application_id/container_id. *
>
>         However, the docker working dir is
>         */yarn-temp/nm-local-dir/usercache/user/appcache/application_id/container_id*.
>         It's a miss match that docker can not locate the launch script.
>
>
>         I found launch script is initially written to a temporary
>         directory, and
>         finally should be copied to workdir. This logic is implemented in
>         DefaultContainerExecutor, but did not find why it is not
>         implemented in
>         LinuxContainerExecutor.
>
>
>         I am not sure if this is a configuration issue. Appreciate any
>         help who
>         have ever tried docker container runtime on Hadoop-3.0.0-alpha2/3.
>
>
>
>         Wei
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>     <mailto:yarn-dev-unsubscribe@hadoop.apache.org>
>     For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>     <mailto:yarn-dev-help@hadoop.apache.org>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message