hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhankun Tang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6620) [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups
Date Mon, 18 Sep 2017 06:50:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169655#comment-16169655

Zhankun Tang commented on YARN-6620:

Good point, I think we should use node attribute to distinguish them. I think this might be
unavoidable: different DL workload needs different driver versions / GPU architectures, and
different frameworks like OpenCL/CUDA, we need node attribute anyway.
[~wangda], Yeah. Node attributes is a must.
And just another thing come to my mind, do we need to support one physical machine with two
different vendor GPU cards? If this scenario requirement is true, we may need to extend resource
handler to mange different several plugins(I've done this in prior FPGA patch) as below:
1. In "bootstrap" method, all GPU vendor's plugin register to one GPU resource handler with
the resource name it can handlers. For instance, one plugin A registers a resource "A-GPU"
and B register "B-GPU". And GPU resource handler will holds records of <resourceName, pluginInstance>.
2. When "preStart" invoked, it will retrieve the ResourceInformation array from container.getResource().getResources()
to find a proper GPU vendor plugin to do plugin callback( or no callback needed for GPU. It
seems needed for FPGA) and then use GPU allocator allocates requested count of this specific
type of GPU in a round-robin manner. Then do cgroups isolation.
3. Now back to the AM, it's possible to request a container with one "A-GPU" named resource
in containerRequest and node attributes "CUDA v1" at the same time.

I'm not sure if this one host with different vendor device is a real requirements. If so,
it may brings another concerns to our current design since we treat them as the same resource
implicitly. Any idea?

> [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups
> -------------------------------------------------------------------------------------
>                 Key: YARN-6620
>                 URL: https://issues.apache.org/jira/browse/YARN-6620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>         Attachments: YARN-6620.001.patch, YARN-6620.002.patch, YARN-6620.003.patch, YARN-6620.004.patch,
YARN-6620.005.patch, YARN-6620.006-WIP.patch
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message