hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7481) Gpu locality support for Better AI scheduling
Date Tue, 14 Nov 2017 04:25:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250823#comment-16250823

Wangda Tan commented on YARN-7481:

Thanks [~qinchen@microsoft.com]/Myeongjae for filing the JIRA and uploading design docs. 

As discussed offline, even though I think the ideal case is to let YARN understand GPU hierarchy
and assign GPUs with best interconnect. I'm still looking to bring this effort forward since
this enables some use cases of GPU and I can see the common functionalities of local resource
affinity can be reused by other features as well (for example network/SSDs/CPU, etc.).

Regarding to the design / prototype:

The proposed approach added one more field to the resource class. I'm not sure what's the
best solution of this. Probably we can add a new type to {{org.apache.hadoop.yarn.api.protocolrecords.ResourceTypes}}
(Such as vector/bitmap, etc.), and an inherited ResourceInformation class to return the vector/bitmap.
Bitmap/vector type will be excluded from scheduling / DRF calculation, etc. 

+ folks who might be interested in this effort:


> Gpu locality support for Better AI scheduling
> ---------------------------------------------
>                 Key: YARN-7481
>                 URL: https://issues.apache.org/jira/browse/YARN-7481
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: api, RM, yarn
>    Affects Versions: 2.7.2
>            Reporter: Chen Qingcha
>             Fix For: 2.7.2
>         Attachments: GPU locality support for Job scheduling.pdf, hadoop-2.7.2-gpu.patch
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as countable resource.

> However, GPU placement is also very important to deep learning job for better efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu {0, 7}, if
GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which support
fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage and locality
information in a node (up to 64 GPUs per node). '1' means available and '0' otherwise in
the corresponding position of the bit.   

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message