mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Pinkul <>
Subject GPU containerizer post mortem
Date Mon, 12 Sep 2016 23:48:56 GMT
Hi everyone,
I just ran into a very subtle configuration problem with enabling GPU support on Mesos and
thought I'd share a brief post mortem.
Scenario:Running a GPU Mesos task. This task first executes nvidia-smi to confirm the GPUs
are visible and then executes a Caffe training example to verify the GPU is usable.
Symptom:The nvidia-smi reported the correct number of GPUs but the training example crashed
when creating the CUDA device.
Debugging tactics:To debug this I added an infinite loop to the end of the task so the environment
would not be torn down. Next I logged into the machine, found the PID of the Mesos task and
entered the namespace with: nsenter -t $TASK_PID -m -u -i -n -p -r -w

At this point I attempted to manually run the test and it worked. The reason it worked was
that my test terminal was not added to the devices CGROUP. So next I added it to the CGROUP
with:echo $TEST_TERMINAL_PID >> /sys/fs/cgroup/devices/mesos/f6736041-d403-4494-95fd-604eace34ce1/tasks
After joining the CGROUP I could reproduce the problem and systematically added devices to
the CGROUP's allow list until it worked.
Root cause:After rebooting a machine the nvidia-uvm device is not created automatically. To
create this device "sudo mknod -m 666 /dev/nvidia-uvm c 250 0" was added to a start up script.
The problem with this is that nvidia-uvm uses a major device ID in the experimental range.
One of the consequences of this is that the major device ID might change on boot. This means
the hardcoded value of 250 in the start up script is incorrect. When Mesos starts up it reads
the major device ID from /dev/nvidia-uvm which matched the value given by the start up script.
Then when it created the devices CGROUP it uses that number instead of the correct one. nvidia-smi
worked because it never accessed the nvidia-uvm device.
The fix:Do not hard code the major device ID of nvidia-uvm in a start up script. Instead bring
the device up with:nvidia-modprobe -u -c 0

I hope this information helps someone and a big thanks to Kevin Klues for helping me debug
this issue.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message