aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Yu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image
Date Tue, 06 Sep 2016 18:48:20 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468190#comment-15468190
] 

Jie Yu commented on AURORA-1763:
--------------------------------

Setting rootfs for the executor is another option, but i think that might break thermos because
it assume it can see host rootfs (i might be wrong)? Also, bundling executor (and libmesos.so)
in an image is not trivial because of the ABI compatibility issue. That means thermos needs
to have one docker image for each linux distribution.

> GPU drivers are missing when using a Docker image
> -------------------------------------------------
>
>                 Key: AURORA-1763
>                 URL: https://issues.apache.org/jira/browse/AURORA-1763
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>    Affects Versions: 0.16.0
>            Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified containerizer the Nvidia
drivers are not correctly mounted. As an experiment I launched a task using both mesos-execute
and Aurora using the same Docker image and ran nvidia-smi. During the experiment I noticed
that the /usr/local/nvidia folder was not being mounted properly. To confirm this was the
issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added
it to the Docker image. When this was done the task was able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how /usr/local/nvidia
is mounted from the /mesos directory.
> {noformat}140 102 8:17 /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
/ rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
/mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia rw,nosuid,relatime
master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - securityfs
securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - cgroup
cgroup rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - cgroup
cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:14 - cgroup
cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - cgroup
cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - cgroup
cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:17
- cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - cgroup
cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:19 - cgroup
cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore pstore
rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
> 97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc binfmt_misc
rw
> 98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 99 98 8:17 /mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs
rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 100 99 8:17 /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/sandbox
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs/mnt/mesos/sandbox
rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 67 78 0:33 / /run/user/1001 rw,nosuid,nodev,relatime master:26 - tmpfs tmpfs rw,size=13219080k,mode=700,uid=1001,gid=1001{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message