Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 59E78200BB3 for ; Tue, 18 Oct 2016 18:08:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 588F4160AFD; Tue, 18 Oct 2016 16:08:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9C59E160ACE for ; Tue, 18 Oct 2016 18:07:59 +0200 (CEST) Received: (qmail 47781 invoked by uid 500); 18 Oct 2016 16:07:58 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 47735 invoked by uid 99); 18 Oct 2016 16:07:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Oct 2016 16:07:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 80FB62C0059 for ; Tue, 18 Oct 2016 16:07:58 +0000 (UTC) Date: Tue, 18 Oct 2016 16:07:58 +0000 (UTC) From: "Kevin Klues (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-6383) NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably using an older set of API calls? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 18 Oct 2016 16:08:00 -0000 [ https://issues.apache.org/jira/browse/MESOS-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585846#comment-15585846 ] Kevin Klues commented on MESOS-6383: ------------------------------------ Hi Dylan, Thanks for reporting this. I see the problem, but it's not immediately clear to me what the solution would be. We don't want to just parse the output of {{nvidia-smi}} because that also changes from version to version (we talked with Nvidia directly about this, and they *highly* discouraged trying to rely on the output of {{nvidia-smi}}). One thing I could imagine doing is to change the code that attempts to load the {{nvmlDeviceGetMinorNumber}} symbol from NVML. It could attempt to load the symbol, and if it failed, it would fall back to implementing our wrapper function for {{nvml::deviceGetMinorNumber()}} using a different method (meaning there would be no changes to {{NvidiaGpuAllocator}}. Do you know what (if any) methods were available in the 5.319 driver to determine the minor number? How does the old {{nvidia-smi}} determine them? Also, are you sure this is the only symbol we aren't able to load from the old driver, or did you just hit this one first? > NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably using an older set of API calls? > ---------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: MESOS-6383 > URL: https://issues.apache.org/jira/browse/MESOS-6383 > Project: Mesos > Issue Type: Improvement > Affects Versions: 1.0.1 > Reporter: Dylan Bethune-Waddell > Priority: Minor > Labels: gpu > > We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We are not in a position to upgrade the Nvidia drivers in the near future, and are currently at driver version 319.72 > When attempting to launch an agent with the following command and take advantage of Nvidia GPU support (master address elided): > bq. {{./bin/mesos-agent.sh --master=: --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"}} > I receive the following error message: > bq. {{Failed to create a containerizer: Failed call to NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber}} > Based on the change log for the NVML module, it seems that {{nvmlDeviceGetMinorNumber}} is only available for driver versions 331 and later as per info under the [Changes between NVML v5.319 Update and v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log] heading in the NVML API reference. > Is there is an alternate method of obtaining this information at runtime to enable support for older versions of the Nvidia driver? Based on discussion in the design document, obtaining this information from the {{nvidia-smi}} command output is a feasible alternative. > I am willing to submit a PR that amends the behaviour of {{NvidiaGpuAllocator}} such that it first attempts calls to {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the symbol cannot be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option obtained from mesos-agent if provided or attempts to run {{nvidia-smi}} if found on path and parses the output to obtain this information. Otherwise, raise an exception indicating all this was attempted. > Would a function or class for parsing {{nvidia-smi}} output be a useful contribution? -- This message was sent by Atlassian JIRA (v6.3.4#6332)