hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4470) Application Master in-place upgrade
Date Sat, 19 Dec 2015 13:19:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065369#comment-15065369

Steve Loughran commented on YARN-4470:

In SLIDER-787 we've already implemented AM upgrade. Specifically, we just have the AM commit
suicide and rely on AM restart to bring itself back up, getting the list of containers back
and then rebuilding our state. We also rely on the RM to update the HDFS and other tokens
as well as the AM/RM token.

As the NMs download the resources again, we pick up the new binaries. What we can't do currently
is (a) change AM resource requirements or (b) avoid that AM restart being mistaken for a failure.
YARN-3417 proposes a specific exit code there.

Accordingly, I'm not convinced we need to do anything here other than treat a specific AM
failure exit code/reported exit as a "restart is not a failure"

It does require them AM to initiate the upgrade —but it needs to do this for container upgrades
anyway. Without the AM doing that part of the process, you'd end up with the AM at, say, v1.3
and the containers at 1.2. The AM needs to think about version mismatch in AM/container communications,
and how to upgrade the containers by selective restart.

the clients don't need to worry about handoff across versions provided they don't cache URLs/IPC
connections, but they need to recover those for AM failover anyway. Same for containers, which
need to cope with the AM coming up somewhere else. We use the YARN-913 registry binding for

The main enhancements of this proposal there are (a) side-by-side startup & handoff and
(b) rollback. Rollback isn't necessarily something that an app can easily do: what happens
if the upgrade AM fails in "that short time period" after changing some state in HDFS, ZK,
the containers, etc: you may be able to rollback the binaries, but the persistent state can
have changed.

w.r.t side-by-side, again, there's that time window. In slider we build up our internal state
on a restart based on the containers we get in AM registration, updating it as queued container
failure events start coming in. We actually have to synchronize the AM rebuild process so
that container callbacks don't come until that state has been rebuilt. If the AM came up alongside
the existing one, it'd get confused pretty fast in the presence of container failures during
this handoff period. Either it'd be told of them (state current, new container requests triggered)
 or not told of them (state inconsistent). You'd have to do a lot of work

To summarise: even if this feature existed I don't think we'd move slider to it; all we'd
like is the YARN-3417 exit code, the ability to restart in the same container (==no queuing
delay) and the ability to request expanded AM resources. I could imagine actually separating
the two: request a resize in the AM container, then, once granted, triggering the restart.
Otherwise, we've got the complexity in the code for AM upgrades, with the hard part actually
dealing with AM restart midway through rolling container upgrade, and rollback of container

I think before trying to implement this feature, have a go at implementing rolling upgrades
in an existing app and see what's missing.

> Application Master in-place upgrade
> -----------------------------------
>                 Key: YARN-4470
>                 URL: https://issues.apache.org/jira/browse/YARN-4470
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: resourcemanager
>            Reporter: Giovanni Matteo Fumarola
>            Assignee: Giovanni Matteo Fumarola
>         Attachments: AM in-place upgrade design. rev1.pdf
> It would be nice if clients could ask for an AM in-place upgrade.
> It will give to YARN the possibility to upgrade the AM, without losing the work
> done within its containers. This allows to deploy bug-fixes and new versions of the AM
incurring in long service downtimes.

This message was sent by Atlassian JIRA

View raw message