flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongwon Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-8431) Allow to specify # GPUs for TaskManager in Mesos
Date Wed, 17 Jan 2018 07:23:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328386#comment-16328386
] 

Dongwon Kim commented on FLINK-8431:
------------------------------------

[~eronwright] I'm testing my implementation by launching a standalone Flink cluster using
{{./bin/mesos-appmaster.sh}}. I tested the following scenarios with Mesos configured with
{{--filter_gpu_resources}}.
 * *When {{mesos.resourcemanager.tasks.gpus}} is not specified or is set to 0.0*
 ** {{LaunchCoordinator}} isn't given any offer because {{MesosFlinkResourceManager}} does
not enable {{GPU_RESOURCES}} capability when {{mesos.resourcemanager.tasks.gpus}} is not specified
or it is set to 0.
 * *When {{mesos.resourcemanager.tasks.gpus}} is smaller than or equal to the available GPUs
on a node* 
 ** Given offers, {{LaunchCoordinator}} aggregates offers of different roles from the same
node and puts aggregated offers to Fenzo for scheduling resources over nodes. When notified
of the success of scheduling from Fenzo, {{LaunchCoordinator}} allocates resources of different
roles to tasks and then populate {{Protos.TaskInfo}} using the allocated resources which is
then wired to the Mesos master.
 * *When {{mesos.resourcemanager.tasks.gpus}} is bigger than the available GPUs on a node* 
 ** Given offers, {{LaunchCoordinator}} aggregates offers of different roles from the same
node and puts aggregated offers to Fenzo. However, Fenzo notifies {{LaunchCoordinator}} of
the failure of scheduling with the following messages:
     AssignmentFailure \{resource=Other, asking=3.0, used=0.0, available=2.0, message=gpus}.

> Allow to specify # GPUs for TaskManager in Mesos
> ------------------------------------------------
>
>                 Key: FLINK-8431
>                 URL: https://issues.apache.org/jira/browse/FLINK-8431
>             Project: Flink
>          Issue Type: Improvement
>          Components: Cluster Management, Mesos
>            Reporter: Dongwon Kim
>            Assignee: Dongwon Kim
>            Priority: Minor
>
> Mesos provides first-class support for Nvidia GPUs [1], but Flink does not exploit it
when scheduling TaskManagers. If Mesos agents are configured to isolate GPUs as shown in [2],
TaskManagers that do not specify to use GPUs cannot see GPUs at all.
> We, therefore, need to introduce a new configuration property named "mesos.resourcemanager.tasks.gpus"
to allow users to specify # of GPUs for each TaskManager process in Mesos.
> [1] http://mesos.apache.org/documentation/latest/gpu-support/
> [2] http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message