hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhankun Tang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
Date Wed, 18 Oct 2017 06:24:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208861#comment-16208861
] 

Zhankun Tang commented on YARN-6620:
------------------------------------

[~wangda], thanks for the clarification. 
The below code confuses me previously is clear now:

{code:java}
public static final Map<String, ResourceInformation> MANDATORY_RESOURCES =
      ImmutableMap.of(MEMORY_URI, MEMORY_MB, VCORES_URI, VCORES, GPU_URI, GPUS);
...
private static void checkMandatoryResources(
...
if (!expectedUnit.equals(actualUnit) || !expectedType.equals(
            actualType)) {
  ...
}
...
}
{code}

The above code indicates that "yarn.io/gpu" should be defined in resource-type.xml(type name)
and node-resource.xml(total count) by admin with exact yarn expectation. On the other hand,
the admin-allowed minor device numbers are declared in yarn-site.xml. In the end, the major
and minor device number is also declared in gpu section of container-executor.cfg(by root
user). 

And as we mentioned before, even using the same "yarn.io/gpu", a different vendor's GPU can
be handled by node attributes to meet scheduling needs in a heterogeneous cluster. But more
widely, if the vendor's device needs different toolchain for discovering or flashing( in FPGA
cases), current one resource handler instance might be not enough for handling all toolchain
operations.

Anyway, I'm satisfied with the current design and let's evolve it when we get more cases.


> Add support in NodeManager to isolate GPU devices by using CGroups
> ------------------------------------------------------------------
>
>                 Key: YARN-6620
>                 URL: https://issues.apache.org/jira/browse/YARN-6620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>             Fix For: 3.1.0
>
>         Attachments: YARN-6620.001.patch, YARN-6620.002.patch, YARN-6620.003.patch, YARN-6620.004.patch,
YARN-6620.005.patch, YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, YARN-6620.009.patch,
YARN-6620.010.patch, YARN-6620.011.patch, YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch,
YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message