mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dylan Bethune-Waddell (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-6383) NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably using an older set of API calls?
Date Sat, 15 Oct 2016 23:39:20 GMT

     [ https://issues.apache.org/jira/browse/MESOS-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dylan Bethune-Waddell updated MESOS-6383:
-----------------------------------------
    Description: 
We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We are not in a
position to upgrade the Nvidia drivers in the near future, and are currently at driver version
319.72

When attempting to launch an agent with the following command and take advantage of Nvidia
GPU support (master address elided):

bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> --work_dir=/tmp/mesos
--isolation="cgroups/devices,gpu/nvidia"}}

I receive the following error message:

bq. {{Failed to create a containerizer: Failed call to NvidiaGpuAllocator::resources: Failed
to nvml::initialize: Failed to load symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol
'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : /usr/lib64/libnvidia-ml.so.1: undefined
symbol: nvmlDeviceGetMinorNumber}}

Based on the change log for the NVML module, it seems that {{nvmlDeviceGetMinorNumber}} is
only available for driver versions 331 and later as per info under the [Changes between NVML
v5.319 Update and v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log]
heading in the NVML API reference.

Is there is an alternate method of obtaining this information at runtime to enable support
for older versions of the Nvidia driver? Based on discussion in the design document, obtaining
this information from the {{nvidia-smi}} command output is a feasible alternative. 

I am willing to submit a PR that amends the behaviour of {{NvidiaGpuAllocator}} such that
it first attempts calls to {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if the
symbol cannot be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option obtained
from mesos-agent if provided or attempts to run {{nvidia-smi}} if found on path and parses
the output to obtain this information. Otherwise, raise an exception indicating all this was
attempted.

Would a function or class for parsing {{nvidia-smi}} output be a useful contribution?

  was:
We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We are not in a
position to upgrade the Nvidia drivers in the near future, and are currently at driver version
319.72

When attempting to launch an agent with the following command and take advantage of Nvidia
GPU support (master address elided):

bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> --work_dir=/tmp/mesos
--isolation="cgroups/devices,gpu/nvidia"}}

I receive the following error message:

bq. {{Failed to create a containerizer: Failed call to NvidiaGpuAllocator::resources: Failed
to nvml::initialize: Failed to load symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol
'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : /usr/lib64/libnvidia-ml.so.1: undefined
symbol: nvmlDeviceGetMinorNumber}}

Based on the change log for the NVML module, it seems that {{nvmlDeviceGetMinorNumber}} is
only available for driver versions 331 and later as per info under the [Changes between NVML
v5.319 Update and v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log]
heading in the NVML API reference.

Is there is an alternate method of obtaining this information at runtime to enable support
for older versions of the Nvidia driver? A modest search has not yet yielded much insight
on a path forward.


> NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device
minor number be ascertained reliably using an older set of API calls?
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-6383
>                 URL: https://issues.apache.org/jira/browse/MESOS-6383
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 1.0.1
>            Reporter: Dylan Bethune-Waddell
>            Priority: Minor
>              Labels: gpu
>
> We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We are not
in a position to upgrade the Nvidia drivers in the near future, and are currently at driver
version 319.72
> When attempting to launch an agent with the following command and take advantage of Nvidia
GPU support (master address elided):
> bq. {{./bin/mesos-agent.sh --master=<masterIP>:<masterPort> --work_dir=/tmp/mesos
--isolation="cgroups/devices,gpu/nvidia"}}
> I receive the following error message:
> bq. {{Failed to create a containerizer: Failed call to NvidiaGpuAllocator::resources:
Failed to nvml::initialize: Failed to load symbol 'nvmlDeviceGetMinorNumber': Error looking
up symbol 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : /usr/lib64/libnvidia-ml.so.1:
undefined symbol: nvmlDeviceGetMinorNumber}}
> Based on the change log for the NVML module, it seems that {{nvmlDeviceGetMinorNumber}}
is only available for driver versions 331 and later as per info under the [Changes between
NVML v5.319 Update and v331|http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log]
heading in the NVML API reference.
> Is there is an alternate method of obtaining this information at runtime to enable support
for older versions of the Nvidia driver? Based on discussion in the design document, obtaining
this information from the {{nvidia-smi}} command output is a feasible alternative. 
> I am willing to submit a PR that amends the behaviour of {{NvidiaGpuAllocator}} such
that it first attempts calls to {{nvml::nvmlGetDeviceMinorNumber}} via libnvidia-ml, and if
the symbol cannot be found, falls back on {{--nvidia-smi="/path/to/nvidia-smi"}} option obtained
from mesos-agent if provided or attempts to run {{nvidia-smi}} if found on path and parses
the output to obtain this information. Otherwise, raise an exception indicating all this was
attempted.
> Would a function or class for parsing {{nvidia-smi}} output be a useful contribution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message