Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D25FB200B95 for ; Tue, 13 Sep 2016 01:49:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D0FB1160AB8; Mon, 12 Sep 2016 23:49:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2456F160AC8 for ; Tue, 13 Sep 2016 01:49:08 +0200 (CEST) Received: (qmail 9357 invoked by uid 500); 12 Sep 2016 23:49:08 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 8472 invoked by uid 99); 12 Sep 2016 23:49:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Sep 2016 23:49:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3CBE31A7313; Mon, 12 Sep 2016 23:49:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.979 X-Spam-Level: * X-Spam-Status: No, score=1.979 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 6KqQvMEJWXox; Mon, 12 Sep 2016 23:49:05 +0000 (UTC) Received: from BLU004-OMC4S2.hotmail.com (blu004-omc4s2.hotmail.com [65.55.111.141]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id D92945F299; Mon, 12 Sep 2016 23:49:04 +0000 (UTC) Received: from BLU168-W42 ([65.55.111.137]) by BLU004-OMC4S2.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Mon, 12 Sep 2016 16:48:57 -0700 X-TMN: [xshHWRTfj6Qsrm0ts3dUMPi6lfYv26Ld] X-Originating-Email: [jpinkul@live.com] Message-ID: Content-Type: multipart/alternative; boundary="_344acffb-d313-46f8-b172-d9954a3e01a7_" From: Justin Pinkul To: "user@mesos.apache.org" , "dev@mesos.apache.org" Subject: GPU containerizer post mortem Date: Mon, 12 Sep 2016 23:48:56 +0000 Importance: Normal MIME-Version: 1.0 X-OriginalArrivalTime: 12 Sep 2016 23:48:57.0174 (UTC) FILETIME=[3A2D0760:01D20D50] archived-at: Mon, 12 Sep 2016 23:49:10 -0000 --_344acffb-d313-46f8-b172-d9954a3e01a7_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi everyone=2C I just ran into a very subtle configuration problem with enabling GPU suppo= rt on Mesos and thought I'd share a brief post mortem. Scenario:Running a GPU Mesos task. This task first executes nvidia-smi to c= onfirm the GPUs are visible and then executes a Caffe training example to v= erify the GPU is usable. Symptom:The nvidia-smi reported the correct number of GPUs but the training= example crashed when creating the CUDA device. Debugging tactics:To debug this I added an infinite loop to the end of the = task so the environment would not be torn down. Next I logged into the mach= ine=2C found the PID of the Mesos task and entered the namespace with: nsen= ter -t $TASK_PID -m -u -i -n -p -r -w At this point I attempted to manually run the test and it worked. The reaso= n it worked was that my test terminal was not added to the devices CGROUP. = So next I added it to the CGROUP with:echo $TEST_TERMINAL_PID >> /sys/fs/cg= roup/devices/mesos/f6736041-d403-4494-95fd-604eace34ce1/tasks After joining the CGROUP I could reproduce the problem and systematically a= dded devices to the CGROUP's allow list until it worked. Root cause:After rebooting a machine the nvidia-uvm device is not created a= utomatically. To create this device "sudo mknod -m 666 /dev/nvidia-uvm c 25= 0 0" was added to a start up script. The problem with this is that nvidia-u= vm uses a major device ID in the experimental range. One of the consequence= s of this is that the major device ID might change on boot. This means the = hardcoded value of 250 in the start up script is incorrect. When Mesos star= ts up it reads the major device ID from /dev/nvidia-uvm which matched the v= alue given by the start up script. Then when it created the devices CGROUP = it uses that number instead of the correct one. nvidia-smi worked because i= t never accessed the nvidia-uvm device. The fix:Do not hard code the major device ID of nvidia-uvm in a start up sc= ript. Instead bring the device up with:nvidia-modprobe -u -c 0 I hope this information helps someone and a big thanks to Kevin Klues for h= elping me debug this issue. Justin = --_344acffb-d313-46f8-b172-d9954a3e01a7_--