hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Subru Krishnan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN
Date Thu, 23 Feb 2017 03:34:45 GMT

    [ https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879794#comment-15879794

Subru Krishnan commented on YARN-6223:

Thanks [~leftnoteasy] for initiating this. I had a clarification regarding GPU scheduling
- are you planning to use YARN-3926?
We are planning to use it because:
  1. I feel the alternate approaches (YARN-4122/YARN-5517) are neither generic nor scaleable
  2. Moreover YARN-3926 seems to be close to completion.

Thoughts? cc [~vvasudev], [~asuresh].

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN
> ------------------------------------------------------------------------------------
>                 Key: YARN-6223
>                 URL: https://issues.apache.org/jira/browse/YARN-6223
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
> As varieties of workloads are moving to YARN, including machine learning / deep learning
which can speed up by leveraging GPU computation power. Workloads should be able to request
GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and architectures
on each node, or more advanced, NodeManager can automatically discover GPU resources and architectures
and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU
and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should
properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced an extensible
framework to support isolation for different resource types and different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but different
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource protocol instead
of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve the problem
listed at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor
device number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message