hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhankun Tang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
Date Wed, 18 Oct 2017 06:24:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208861#comment-16208861

Zhankun Tang commented on YARN-6620:

[~wangda], thanks for the clarification. 
The below code confuses me previously is clear now:

public static final Map<String, ResourceInformation> MANDATORY_RESOURCES =
private static void checkMandatoryResources(
if (!expectedUnit.equals(actualUnit) || !expectedType.equals(
            actualType)) {

The above code indicates that "yarn.io/gpu" should be defined in resource-type.xml(type name)
and node-resource.xml(total count) by admin with exact yarn expectation. On the other hand,
the admin-allowed minor device numbers are declared in yarn-site.xml. In the end, the major
and minor device number is also declared in gpu section of container-executor.cfg(by root

And as we mentioned before, even using the same "yarn.io/gpu", a different vendor's GPU can
be handled by node attributes to meet scheduling needs in a heterogeneous cluster. But more
widely, if the vendor's device needs different toolchain for discovering or flashing( in FPGA
cases), current one resource handler instance might be not enough for handling all toolchain

Anyway, I'm satisfied with the current design and let's evolve it when we get more cases.

> Add support in NodeManager to isolate GPU devices by using CGroups
> ------------------------------------------------------------------
>                 Key: YARN-6620
>                 URL: https://issues.apache.org/jira/browse/YARN-6620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>             Fix For: 3.1.0
>         Attachments: YARN-6620.001.patch, YARN-6620.002.patch, YARN-6620.003.patch, YARN-6620.004.patch,
YARN-6620.005.patch, YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, YARN-6620.009.patch,
YARN-6620.010.patch, YARN-6620.011.patch, YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch,
YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message