aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 黄 凯 <>
Subject Discussion on review request 51536
Date Fri, 02 Sep 2016 01:40:41 GMT
Hi Folks,

I'm currently working on a feature on aurora scheduler and executor. The implementation strategy
became controversial on the review board, so I was wondering if I should broadcast it to more
audience and initiate a discussion. Please feel free to let me know your thoughts, your help
is greatly appreciated!

The high level goal of this feature is to improve reliability and performance of the Aurora
scheduler job updater, by relying on health check status rather than watch_secs timeout when
deciding an individual instance update state.

Please see the original review request
aurora JIRA ticket
design doc
for more details and background.

Note: The design doc becomes a little bit outdated on the "scheduler change summary" part
(this is what the review request trying to address). As a result, I've left some comment to
clarify the latest proposed implementation plan for scheduler change.

There are two questions I'm trying to address here:
1. How does the scheduler infer the executor version and be backward compatible?
2. Where do we determine if health check is enabled?

In short, there are 3 different solutions proposed on the review board.

In the first two approaches, the scheduler will rely on a string to determine the executor
version. We determine whether health check is enabled merely on executor side. There will
be communication between the executor and the scheduler.
Solution 1:
vCurrent executor sends a message in its health check thread during RUNNING state transition,
and the vCurrent updater will infer the executor version from the presence of this message,
and skip the watch_secs if necessary.

Solution 2:
Instead of relying on the presence of an arbitrary string in the message, rely on the presence
of a string like: "capabilities:CAPABILITY_1,CAPABILITY-2" where CAPABILITY_1 and CAPABILITY_2
(etc.) are constants defined in api.thrift. Basically just formalizing the mechanism and making
it a bit more future proof.

In the third solution, the scheduler infers the executor version from the JobUpdateSettings
on scheduler side.
Solution 3:
Adding a bit to JobUpdateSettings which is ‘executorDrivenUpdates', if that is set, the
scheduler assumes that the transition from STARTING -> RUNNING makes the executor healthy
and concurrently, we release thermos and change HealthCheckConfig to say that it should only
go to running after healthy.

Pros and Cons:
The main benefit of Solution 1 is:
1. By using the message in task status update, we don't have to make any schema change, which
makes the design simple.
2. The feature is fully backward-compatible. When we roll out the vCurrent schedulers and
executors, we do not have to instruct the users to provide additional field in the Job or
Update configs, which could confuses customers when the vPrev and vCurrent executor coexist
in the cluster.

Relying on the presence of a message makes things brittle. Also we do not want to expose this
message to users.

The benefit of Solution 2 is making the feature more future proof. However, if we do not envision
a new executor feature in the short term, it's not too much different from Solution 1.

The benefits of Solution 3 include:
1. We support more than just thermos now (and others rely on custom executors).
2. A lot of things in Aurora treat the executor as opaque. The status update message sent
by executor should not be visible to users only if it's an error message.

1. In addition to the ‘executorDrivenUpdates' bit that identifies the executor version,
we still need to notify the scheduler if health check is enabled on vCurrent executor, if
not, the scheduler must be able to fall back to use watch_secs.
2. The users have to provide an additional field in their .aurora config files. The feature
wouldn't be available unless new clients are rolled out as well.

Please let me know if I understand your suggestions correctly and hopefully everyone is on
the same page!


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message