hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1404) Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling
Date Tue, 10 Dec 2013 15:32:09 GMT

    [ https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844344#comment-13844344
] 

Arun C Murthy commented on YARN-1404:
-------------------------------------

I've spent time thinking about this in the context of running a myriad of external systems
in YARN such as Impala, HDFS Caching (HDFS-4949) and some others.

The overarching goal is to allow YARN to act as a ResourceManager for the overall cluster
*and* a Workload Manager for external systems i.e. this way Impala or HDFS can rely on YARN's
queues for workload management, SLAs via preemption etc.

Is that a good characterization of the problem at hand?

I think it's a good goal to support - this will allow other external systems to leverage YARN's
capabilities for both resource sharing and workload management.

Now, if we all agree on this - we can figure the best way to support this in a first-class
manner.

----

Ok, the core requirement is for an external system (Impala, HDFS, others) to leverage YARN's
workload management capabilities (queues etc.) to acquire resources (cpu, memory) *on behalf*
of a particular entity (user, queue) for completing a user's request (run a query, cache a
dataset in RAM). 

The *key* is that these external systems need to acquire resources on behalf of the user and
ensure that the chargeback is applied to the correct user, queue etc.

This is a *brand new requirement* for YARN... so far, we have assumed that the entity acquiring
the resource would also be actually utilizing the resource by launching a container etc. 

Here, it's clear that the requirement is that entity acquiring the resource would like to
*delegate* the resource to an external framework. For e.g.
# A user query would like to acquire cpu, memory etc. for appropriate accounting chargeback
and then delegate it to Impala.
# A user request for caching data would like to acquire memory for appropriate accounting
chargeback and then delegate to the Datanode.

----

In this scenario, I think explicitly allowing for *delegation* of a container would solve
the problem in a first-class manner.

We should add a new API to the NodeManager which would allow an application to *delegate*
a container's resources to a different container:

{code:title=ContainerManagementProtocol.java|borderStyle=solid}  
public interface ContainerManagementProtocol {
  // ...
  public DelegateContainerResponse delegateContainer(DelegateContainerRequest request);
  // ...
}
{code}

{code:title=DelegateContainerRequest.java|borderStyle=solid}  
public abstract class DelegateContainerRequest {
  // ...
  public ContainerLaunchContext getSourceContainer();

  public ContainerId getTargetContainer();
  // ...
}
{code}


The implementation of this api would notify the NodeManager to change it's monitoring on the
recipient container i.e. Impala or Datanode by modifying cgroup of the recipient container.

Similarly, the NodeManager could be instructed by the ResourceManager to preempt the resources
of the source container for continuing to serve the global SLAs of the queues - again, this
is implemented by modifying the cgroup of the recipient container. This will allow for ResouceManager/NodeManager
to be explicitly in control of resources, even in the face of misbehaving AMs etc.

----

The result of the above proposal is very similar to what is already being discussed, the only
difference being that this is explicit (NodeManager knows the source and recipient containers)
and this allows for all existing features such as preemption, over-allocation of resources
to YARN queues etc. to continue to work as today.

----

Thoughts?

> Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource
scheduling
> -----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1404
>                 URL: https://issues.apache.org/jira/browse/YARN-1404
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.2.0
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: YARN-1404.patch
>
>
> Currently Hadoop Yarn expects to manage the lifecycle of the processes its applications
run workload in. External frameworks/systems could benefit from sharing resources with other
Yarn applications while running their workload within long-running processes owned by the
external framework (in other words, running their workload outside of the context of a Yarn
container process). 
> Because Yarn provides robust and scalable resource management, it is desirable for some
external systems to leverage the resource governance capabilities of Yarn (queues, capacities,
scheduling, access control) while supplying their own resource enforcement.
> Impala is an example of such system. Impala uses Llama (http://cloudera.github.io/llama/)
to request resources from Yarn.
> Impala runs an impalad process in every node of the cluster, when a user submits a query,
the processing is broken into 'query fragments' which are run in multiple impalad processes
leveraging data locality (similar to Map-Reduce Mappers processing a collocated HDFS block
of input data).
> The execution of a 'query fragment' requires an amount of CPU and memory in the impalad.
As the impalad shares the host with other services (HDFS DataNode, Yarn NodeManager, Hbase
Region Server) and Yarn Applications (MapReduce tasks).
> To ensure cluster utilization that follow the Yarn scheduler policies and it does not
overload the cluster nodes, before running a 'query fragment' in a node, Impala requests the
required amount of CPU and memory from Yarn. Once the requested CPU and memory has been allocated,
Impala starts running the 'query fragment' taking care that the 'query fragment' does not
use more resources than the ones that have been allocated. Memory is book kept per 'query
fragment' and the threads used for the processing of the 'query fragment' are placed under
a cgroup to contain CPU utilization.
> Today, for all resources that have been asked to Yarn RM, a (container) process must
be started via the corresponding NodeManager. Failing to do this, will result on the cancelation
of the container allocation relinquishing the acquired resource capacity back to the pool
of available resources. To avoid this, Impala starts a dummy container process doing 'sleep
10y'.
> Using a dummy container process has its drawbacks:
> * the dummy container process is in a cgroup with a given number of CPU shares that are
not used and Impala is re-issuing those CPU shares to another cgroup for the thread running
the 'query fragment'. The cgroup CPU enforcement works correctly because of the CPU controller
implementation (but the formal specified behavior is actually undefined).
> * Impala may ask for CPU and memory independent of each other. Some requests may be only
memory with no CPU or viceversa. Because a container requires a process, complete absence
of memory or CPU is not possible even if the dummy process is 'sleep', a minimal amount of
memory and CPU is required for the dummy process.
> Because of this it is desirable to be able to have a container without a backing process.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message