Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mesos.apache.org
Message-ID: <BLU168-W420215C3574194903B93BCCCFF0@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_344acffb-d313-46f8-b172-d9954a3e01a7_"
From: Justin Pinkul <jpinkul@live.com>
To: "user@mesos.apache.org" <user@mesos.apache.org>, "dev@mesos.apache.org"
	<dev@mesos.apache.org>
Subject: GPU containerizer post mortem
Date: Mon, 12 Sep 2016 23:48:56 +0000
Importance: Normal
MIME-Version: 1.0
archived-at: Mon, 12 Sep 2016 23:49:10 -0000

--_344acffb-d313-46f8-b172-d9954a3e01a7_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi everyone=2C
I just ran into a very subtle configuration problem with enabling GPU suppo=
rt on Mesos and thought I'd share a brief post mortem.
Scenario:Running a GPU Mesos task. This task first executes nvidia-smi to c=
onfirm the GPUs are visible and then executes a Caffe training example to v=
erify the GPU is usable.
Symptom:The nvidia-smi reported the correct number of GPUs but the training=
 example crashed when creating the CUDA device.
Debugging tactics:To debug this I added an infinite loop to the end of the =
task so the environment would not be torn down. Next I logged into the mach=
ine=2C found the PID of the Mesos task and entered the namespace with: nsen=
ter -t $TASK_PID -m -u -i -n -p -r -w

At this point I attempted to manually run the test and it worked. The reaso=
n it worked was that my test terminal was not added to the devices CGROUP. =
So next I added it to the CGROUP with:echo $TEST_TERMINAL_PID >> /sys/fs/cg=
roup/devices/mesos/f6736041-d403-4494-95fd-604eace34ce1/tasks
After joining the CGROUP I could reproduce the problem and systematically a=
dded devices to the CGROUP's allow list until it worked.
Root cause:After rebooting a machine the nvidia-uvm device is not created a=
utomatically. To create this device "sudo mknod -m 666 /dev/nvidia-uvm c 25=
0 0" was added to a start up script. The problem with this is that nvidia-u=
vm uses a major device ID in the experimental range. One of the consequence=
s of this is that the major device ID might change on boot. This means the =
hardcoded value of 250 in the start up script is incorrect. When Mesos star=
ts up it reads the major device ID from /dev/nvidia-uvm which matched the v=
alue given by the start up script. Then when it created the devices CGROUP =
it uses that number instead of the correct one. nvidia-smi worked because i=
t never accessed the nvidia-uvm device.
The fix:Do not hard code the major device ID of nvidia-uvm in a start up sc=
ript. Instead bring the device up with:nvidia-modprobe -u -c 0

I hope this information helps someone and a big thanks to Kevin Klues for h=
elping me debug this issue.
Justin
 		 	   		  =

--_344acffb-d313-46f8-b172-d9954a3e01a7_--