hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN
Date Thu, 23 Feb 2017 00:47:44 GMT
Wangda Tan created YARN-6223:
--------------------------------

             Summary: [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation
on YARN
                 Key: YARN-6223
                 URL: https://issues.apache.org/jira/browse/YARN-6223
             Project: Hadoop YARN
          Issue Type: New Feature
            Reporter: Wangda Tan
            Assignee: Wangda Tan


As varieties of workloads are moving to YARN, including machine learning / deep learning which
can speed up by leveraging GPU computation power. Workloads should be able to request GPU
from YARN as simple as CPU and memory.

To make a complete GPU story, we should support following pieces:
1) GPU discovery/configuration: Admin can either config GPU resources and architectures on
each node, or more advanced, NodeManager can automatically discover GPU resources and architectures
and report to ResourceManager 

2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU and
memory.

3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should properly
isolate and monitor task's resource usage.

For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced an extensible
framework to support isolation for different resource types and different runtimes.

There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but different solutions:

For scheduling:
- YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource protocol instead
of leveraging YARN-3926.

For isolation:
- And YARN-4122 proposed to use CGroups to do isolation which cannot solve the problem listed
at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor device
number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message