hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
Date Thu, 07 Apr 2016 14:52:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230349#comment-15230349
] 

Varun Vasudev commented on YARN-4876:
-------------------------------------

Thanks for the document [~asuresh]!

Here are my initial thoughts -

{code} Add int field 'destroyDelay' to each 'StartContainerRequest':{code}

I think we should avoid this for now - we should require that AMs that use initialize() must
call destroy and AMs that call start with the ContainerLaunchContext can't call destroy. We
can achieve that by adding the destroyDelay field you mentioned in your document but don't
allow AMs to set it. If initialize is called, set destroyDelay internally to \-1, else to
0. I'm not saying we should drop the feature, just that we should come back to it once we've
sorted out the lifecycle from an initialize->destroy perspective.

{code}
Modify 'StopContainerRequest' Record:
  Add boolean 'destroyContainer':
{code}
Similar to above - let's avoid mixing initialize/destroy with start/stop for now.

{code}
• Introduce a new 'ContainerEventType.START_CONTAINER' event type.
• Introduce a new 'ContainerEventType.DESTROY_CONTAINER' event type.
• The Container remains in the LOCALIZED state until it receives the 'START_CONTAINER' event.
{code}

Can you add a state machine transition diagram to explain how new states and events affect
each other?

{code}
If 'initializeContainer' with a new ContainerLaunchContext is called by the AM while the Container
is RUNNING, It is treated as a KILL_CONTAINER event followed by a CONTAINER_RESOURCE_CLEANUP
and an INIT_CONTAINER event to kick of re-localization after which the Container will return
to LOCALIZED state.
{code}
I'd really like to avoid this specific behavior. I think we should add an explicit re-initialize
API. For a running process, ideally, we want to localize the upgraded bits while the container
is running and then kill the existing process to minimize the downtime. For containers where
localization can take a long time, forcing a kill and then a re-initialize adds quite a serious
amount of downtime. Re-initialize and initialize will probably end up having differing behaviors.
On a similar note, I think we might have to introduce a new "re-initalizing/re-localizing/running-localizing
state" which implies that a container is running but we are carrying out some background work.
In addition, I don't think we can do a cleanup of resources during an upgrade. For services
that have local state in the container work dir, we're essentially wiping away all the local
state and forcing them to start from scratch.
Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm assuming you meant
CLEANUP_CONTAINER_RESOURCES

{code}
• If 'intializeContainer' is called WITHOUT a new ContainerLaunchContext by the AM, it is
considered a restart, and will follow the same code path as 'initializeContainer' with new
ContainerLaunchContext, but will not perform a CONTAINER_RESOURCE_CLEANUP and INIT_CONTAINER.
The Container process will be killed and the container will be returned to LOCALIZED state.
• If 'startContainer' is called WITHOUT a new ContainerLaunchContext by the AM, it treated
exactly as the above case, but it will also trigger a START_CONTAINER event.
{code}
Instead of forcing AMs to make two calls, why don't we just add a restart API that does everything
you've outlined above? It's cleaner and we don't have to do as many condition checks. In addition,
with a restart API we can do stuff like allowing AMs to specify a delay, or some conditions
when the restart should happen.

> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the *ContainerManagementProtocol*
and decouple the actual start of a container from the initialization. This will allow AMs
to re-start a container without having to lose the allocation.
> Additionally, if the localization of the container is associated to the initialize (and
the cleanup with the destroy), This can also be used by applications to upgrade a Container
by *re-initializing* with a new *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message