hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
Date Thu, 25 Feb 2016 03:01:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166633#comment-15166633

Bikas Saha commented on YARN-1040:

I am sorry if I caused a digression by mentioning Slider etc.

I am not sure the upgrade scenario is the only one for this jira since this jira covers a
broader set. Even without upgrades apps can change the processes they are running in a container
without having to lose the container allocation. Identical calls of primitives could be used
without the notion of upgrade. E.g. start a Java process first for a Java task, then launch
a python process for a Python task. To the NM this is identical to starting v1 and then starting
v2. So while it makes sense for the second one to use an API called upgrade, it may not for
the first one. 

(Unrelated to this jira, IMO, YARN should allow upgrade of app code without losing containers
but not necessarily understand it deeply. E.g. YARN need not assume that upgrade will need
additional resource or try to acquire them transparently for the application.)

For the purpose of this jira here is what my thoughts are when I had opened YARN-1292 to delink
process lifecycle from container.
1) new API - acquireContainer - means ask for the allocated resource. The API has a flag to
specify whether process exit implies releaseContainer. This is for backwards compatibility
with a default of true. Apps that want to continue to use that behavior can explicitly pass
true when using the new API and is mainly for reducing number of RPCs for apps like MR/Tez
2) new API - startProcess - means start the remote process
3) new API - stopProcess - means stop the remote process
4) new API - releaseContainer - means release the allocated resource
5) Potentially a new API for localization, though in theory, this could be separate.

Since this fine grained control makes the protocol chatty, we can reduce the RPC traffic by
having a new NM RPC, say NMCommand, that takes a sequence of API primitives that can be sent
in 1 RPC.
So the current API of startContainer effectively becomes NMCommand(1, 2) and stopContainer
becomes NMCommand(3,4). This can be leveraged for backwards compatibility and rolling upgrades.

The above items would effectively delink process and container lifecyle and close out this

This provides the fine grained control in core YARN that can be used for various scenarios
e.g. upgrades without YARN understanding the scenarios. If we need to add higher level notions
for upgrades etc. then those could be done as separate items.

I hope that helps make my thoughts concrete within the scope of this jira.

> De-link container life cycle from the process and add ability to execute multiple processes
in the same long-lived container
> ----------------------------------------------------------------------------------------------------------------------------
>                 Key: YARN-1040
>                 URL: https://issues.apache.org/jira/browse/YARN-1040
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 3.0.0
>            Reporter: Steve Loughran
> The AM should be able to exec >1 process in a container, rather than have the NM automatically
release the container when the single process exits.
> This would let an AM restart a process on the same container repeatedly, which for HBase
would offer locality on a restarted region server.
> We may also want the ability to exec multiple processes in parallel, so that something
could be run in the container while a long-lived process was already running. This can be
useful in monitoring and reconfiguring the long-lived process, as well as shutting it down.

This message was sent by Atlassian JIRA

View raw message