hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun Suresh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
Date Fri, 08 Apr 2016 17:32:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232530#comment-15232530
] 

Arun Suresh commented on YARN-4876:
-----------------------------------

Thanks for the feedback [~vvasudev]..

bq. We can achieve that by adding the destroyDelay field you mentioned in your document but
don't allow AMs to set it. If initialize is called, set destroyDelay internally to -1, else
to 0.
I tend to agree with you, but my intention was to introduce a timeout after which, if no action
is taken by the AM, the Containers is killed. Maybe we can have a default timeout (5 mins
?) and allow AMs to override it.

bq. Can you add a state machine transition diagram to explain how new states and events affect
each other?
Will do.. I was thinking maybe We add another Container State, such as *AWAITING_START* to
explicitly distinguish it from *LOCALIZED* as I had suggested in the initial doc. Shall update
and put it up.

bq. I think we should add an explicit re-initialize/re-localize API. For a running process,
ideally, we want to localize the upgraded bits while the container is running and then kill
the existing process to minimize the downtime.
Yup, agreed.. we had thought about that, but felt that introducing concurrent localization
while running might introduce more states (like you identified - "running-localizing.." etc).
Also, was thinking about what happens when a concurrent localization completes
* Should it move to the AWAITING state that waits for a startContainer command from the AM
(which would increase start-up latency) or should it just start automatically? 
* What happens when a concurrent re-localization attempt fails ? Should the container continue
running / be killed (notified to the RM). If it continues to run, We need to notify the AM
about the failure (or wait for the AM to call getStatus etc.)

In any case, the interactions between AM and the NM/Container would become non-trivial.
I was thinking we should probably do a sequential stop + initialize/localize + start as a
first cut, and tackle concurrent re-initialization is subsequent JIRAs. Furthermore, I was
planning on tackling this in a more principled manner in YARN-4597

bq. Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm assuming you
meant CLEANUP_CONTAINER_RESOURCES
Yup

bq. Instead of forcing AMs to make two calls, why don't we just add a restart API that does
everything you've outlined above? It's cleaner and we don't have to do as many condition checks.
Totally agree!! But I was thinking we get the base initialize/destroy and start/stop APIs
well defined and working as expected.. Was thinking clubbing into composite commands can be
handled in a subsequent JIRA. Since in any case, we do have to handle all these cases when
an AM calls initialize/start while the container is running. Although we can just choose to
ignore all commands except a *restart*, *stop* or *destroy*, but I'd prefer to handle restart
as a composite command.  


> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the *ContainerManagementProtocol*
and decouple the actual start of a container from the initialization. This will allow AMs
to re-start a container without having to lose the allocation.
> Additionally, if the localization of the container is associated to the initialize (and
the cleanup with the destroy), This can also be used by applications to upgrade a Container
by *re-initializing* with a new *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message