hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhankun Tang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-8851) [Umbrella] A new pluggable device plugin framework to ease vendor plugin development
Date Wed, 24 Oct 2018 03:13:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661657#comment-16661657
] 

Zhankun Tang edited comment on YARN-8851 at 10/24/18 3:12 AM:
--------------------------------------------------------------

[~leftnoteasy], Thanks for the review. Answer as below:
{quote}1) From a user perspective, what needs to be implemented? Is it just following two?

DevicePlugin (required)
 DevicePluginScheduler (optional)
{quote}
{color:#d04437}Zhankun->{color} Yeah. Just the follow two.
{quote}2) It's good to see you added a examples package, it will be useful for user to start
with. However instead of providing a fake implementation, can we implement a demo device plugin
that can be actually configured and tested on a single node cluster? This will give more sense
to user how to implement their own plugin. Further, it will be good if you can provide a sanity
test-suit to verify if a device plugin is compatible.
{quote}
{color:#d04437}Zhankun->{color} The fake device plugin can be actually configured and tested.
The only problem in my mind here is in the example it's just a class but not an maven project
with pom.xml in it.  Add pom.xml dependencies in the document and the example device plugin
code comments? 

For the sanity test-suit, will do that.
{quote}3) Some high-level comments about the APIs in DevicePlugin

DeviceRegisterRequest register();
 This is a bit confusing. A register() function is normally a two-side call, e.g a slave registers
itself to a master. But here it simply returns a DeviceRegisterRequest, it looks more like
a getDeviceInfo() API to me.

Set<Device> getDevices();
 is this supposed to return a set available devices? If so, is it better to rename it to "getAvailableDevices"?
{quote}
{color:#d04437}Zhankun->{color} The DeviceRegisterRequest contains the name of the resource
type that plugin wants to register. And maybe other info in the future. How about "DeviceRegisterRequest
getRegisterInfo()"?

Yeah. "getAvailableDevices" is more concrete. I'm afraid once we support monitoring the devices,
this method would be called regularly. The name is also a little confusing to the plugin
which has scheduling logic. It may be confused by what the available means? Do I need to count
the already using devices in? I guess we are actually asking allowed devices? How about "Set<Device>
getAllowedDevices"?
{quote}4) It is interesting to allow customized DevicePluginScheduler, how failure recovery
can be done? Does that mean user needs to implement all the logic about allocated resource
persistent & recovery in NM store? In that case, we are exposing too much YARN internals
in a plugin framework.
{quote}
{color:#d04437}Zhankun->{color} YARN will do bookkeeping and persistent & recovery
of all the customized device plugin scheduler's allocation. The DevicePluginScheduler should
be stateless. Check the API description below, and we ensure the "availabeDevices" we passed
into the API is an immutable set. Calling the API won't affect YARN stability.

Here we ask the plugin this question "hey, there's some available devices at my hand, choose
N for me".

The vendor plugin developer can check it and do customized scheduling based the topology,
utilization, virtualization or health status based on its own idea that we don't know.
{code:java}
/**
* Called when allocating devices. The framework will do all device book keeping
* and fail recovery. So this hook should only do scheduling based on available devices
* passed in. This method could be invoked multiple times.
* @param availableDevices Devices allowed to be chosen from.
* @param count Number of device to be allocated.
* @return a set of {@link Device}
* */
Set<Device> allocateDevices(Set<Device> availableDevices, Integer count);{code}
{quote}5) DevicePluginAdapter doesn't look like a adaptor, it looks more like a base class
of ResourcePlugin to me. Pls correct me if I misunderstood this.
{quote}
{color:#d04437}Zhankun->{color} I'm afraid not. One device plugin instance is wrapped with one
DevicePluginAdapter to be integrated into the YARN ResourcePlugin handling process. In this
angle, the DevicePluginAdapter adapts YARN's requirements to the plugin instance.

I haven't got a better name for it. The previous implementation of DevicePluginAdapter is
to inherit 4 interfaces. Now it only inherit the ResourcePlugin. How about "DeviceResourceImpl"?
{quote}6) It is confusing that DevicePluginAdapter has a reference to ResourcePluginManager,
could you remove that? From what I can see, ResourcePluginManager manages all ResourcePlugins,
and each ResourcePlugins can be instanced by a DevicePluginAdapter.
{quote}
{color:#d04437}Zhankun->{color} Yeah, It's a legacy in WIP patch. Will remove that. One
thing to clarify is that the DevicePluginAdapter itsefl is actually a ResourcePlugin. It is
added into ResourcePluginManager's pluginMap.


was (Author: tangzhankun):
[~leftnoteasy], Thanks for the review. Answer as below:
{quote}1) From a user perspective, what needs to be implemented? Is it just following two?

DevicePlugin (required)
 DevicePluginScheduler (optional)
{quote}
Zhankun-> Yeah. Just the follow two.
{quote}2) It's good to see you added a examples package, it will be useful for user to start
with. However instead of providing a fake implementation, can we implement a demo device plugin
that can be actually configured and tested on a single node cluster? This will give more sense
to user how to implement their own plugin. Further, it will be good if you can provide a sanity
test-suit to verify if a device plugin is compatible.
{quote}
Zhankun-> The fake device plugin can be actually configured and tested. The only problem
in my mind here is in the example it's just a class but not an maven project with pom.xml
in it.  Add pom.xml dependencies in the document and the example device plugin code comments? 

For the sanity test-suit, will do that.
{quote}3) Some high-level comments about the APIs in DevicePlugin

DeviceRegisterRequest register();
 This is a bit confusing. A register() function is normally a two-side call, e.g a slave registers
itself to a master. But here it simply returns a DeviceRegisterRequest, it looks more like
a getDeviceInfo() API to me.

Set<Device> getDevices();
 is this supposed to return a set available devices? If so, is it better to rename it to "getAvailableDevices"?
{quote}
Zhankun-> The DeviceRegisterRequest contains the name of the resource type that plugin
wants to register. And maybe other info in the future. How about "DeviceRegisterRequest getRegisterInfo()"?

Yeah. "getAvailableDevices" is more concrete. I'm afraid once we support monitoring the devices,
this method would be called regularly. The name is also a little confusing to the plugin
which has scheduling logic. It may be confused by what the available means? Do I need to count
the already using devices in? I guess we are actually asking allowed devices? How about "Set<Device>
getAllowedDevices"?
{quote}4) It is interesting to allow customized DevicePluginScheduler, how failure recovery
can be done? Does that mean user needs to implement all the logic about allocated resource
persistent & recovery in NM store? In that case, we are exposing too much YARN internals
in a plugin framework.
{quote}
Zhankun-> YARN will do bookkeeping and persistent & recovery of all the customized
device plugin scheduler's allocation. The DevicePluginScheduler should be stateless. Check
the API description below, and we ensure the "availabeDevices" we passed into the API is
an immutable set. Calling the API won't affect YARN stability.

Here we ask the plugin this question "hey, there's some available devices at my hand, choose
N for me".

The vendor plugin developer can check it and do customized scheduling based the topology,
utilization, virtualization or health status based on its own idea that we don't know.
{code:java}
/**
* Called when allocating devices. The framework will do all device book keeping
* and fail recovery. So this hook should only do scheduling based on available devices
* passed in. This method could be invoked multiple times.
* @param availableDevices Devices allowed to be chosen from.
* @param count Number of device to be allocated.
* @return a set of {@link Device}
* */
Set<Device> allocateDevices(Set<Device> availableDevices, Integer count);{code}
{quote}5) DevicePluginAdapter doesn't look like a adaptor, it looks more like a base class
of ResourcePlugin to me. Pls correct me if I misunderstood this.
{quote}
Zhankun-> I'm afraid not. One device plugin instance is wrapped with one DevicePluginAdapter
to be integrated into the YARN ResourcePlugin handling process. In this angle, the DevicePluginAdapter
adapts YARN's requirements to the plugin instance.

I haven't got a better name for it. The previous implementation of DevicePluginAdapter is
to inherit 4 interfaces. Now it only inherit the ResourcePlugin. How about "DeviceResourceImpl"?
{quote}6) It is confusing that DevicePluginAdapter has a reference to ResourcePluginManager,
could you remove that? From what I can see, ResourcePluginManager manages all ResourcePlugins,
and each ResourcePlugins can be instanced by a DevicePluginAdapter.
{quote}
Zhankun-> Yeah, It's a legacy in WIP patch. Will remove that. One thing to clarify is
that the DevicePluginAdapter itsefl is actually a ResourcePlugin. It is added into ResourcePluginManager's
pluginMap.

> [Umbrella] A new pluggable device plugin framework to ease vendor plugin development
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-8851
>                 URL: https://issues.apache.org/jira/browse/YARN-8851
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: yarn
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Major
>         Attachments: YARN-8851-WIP2-trunk.001.patch, YARN-8851-WIP3-trunk.001.patch,
YARN-8851-WIP4-trunk.001.patch, YARN-8851-WIP5-trunk.001.patch, YARN-8851-WIP6-trunk.001.patch,
YARN-8851-WIP7-trunk.001.patch, [YARN-8851] YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf,
[YARN-8851] YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling way. But it's
difficult for a vendor to implement such a device plugin because the developer needs much
knowledge of YARN internals. And this brings burden to the community to maintain both YARN
core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin development
and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message