aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David McLaughlin <>
Subject Re: Few things we would like to support in aurora scheduler
Date Fri, 17 Jun 2016 02:26:40 GMT
On Thu, Jun 16, 2016 at 1:28 PM, Igor Morozov <> wrote:

> Hi aurora people,
> I would like to start a discussion around few things we would like to see
> supported in aurora scheduler. It is based on our experience of integrating
> aurora into Uber infrastructure and I believe all the items I'm going to
> talk about will benefit the community and people running aurora clusters.
> 1. We support multiple aurora clusters in different failure domains and we
> run services in those domains. The upgrade workflow for those services
> includes rolling out the same version of a service software to all aurora
> clusters concurrently while monitoring the health status and other service
> vitals that includes like checking error logs, service stats,
> downstream/upstream services health. That means we occasionally need to
> manually trigger a rollback if things go south and rollback all the update
> jobs in all aurora clusters for that particular service. So here are the
> problems we discovered so far with this approach:
>        - We don't have an easy way to assign a common unique identifier for
> all JobUpdates in different aurora clusters in order to reconcile them
> later into a single meta update job so to speak. Instead we need to
> generate that ID and keep it in every aurora's JobUpdate
> metadata(JobUpdateRequest.taskConfig). Then in order to get the status the
> upgrade workflow running in different data centers we have to query all
> recent jobs and based on their metadata content try to filter in ones that
> we thing belongs to a currently running upgrade for the service.
> We propose to change
> struct JobUpdateRequest {
>   /** Desired TaskConfig to apply. */
>   1: TaskConfig taskConfig
>   /** Desired number of instances of the task config. */
>   2: i32 instanceCount
>   /** Update settings and limits. */
>   3: JobUpdateSettings settings
> *  /**Optional Job Update key's id, if not specified aurora will generate
> one**/*
> *  4: optional string id*}
> There is potentially another much more involved solution of supporting user
> defined metadata mentioned in this ticket:

I actually think the linked ticket is less involved? It has no impact on
logic, etc. So the work involved is just updating the Thrift object and
then writing in the metadata to the storage layer. But I'm fine with either
(or both!) approaches.

>     -  All that brings us to a second problem we had to deal with during
> the upgrade:
> We don't have a good way to manually trigger a job update rollback in
> aurora. The use case is again the same, while running multiple update jobs
> in different aurora clusters we have a real production requirement to start
> rolling back update jobs if things are misbehaving and the nature of this
> misbehavior could be potentially very complex. Currently we abort the job
> update and start a new one that would essentially roll cluster forward to a
> previously run version of the software.
> We propose a new convenience API to rollback a running or complete
> JobUpdate:
> *  /**Rollback job update. */*
> *  Response rollbackJobUpdate(*
> *      /** The update to rollback. */*
> *      1: JobUpdateKey key,*
> *      /** A user-specified message to include with the induced job update
> state change. */*
> *      3: string message)*

+1 to the idea, but there is ambiguity in what rollback means when you pass
a JobUpdateKey.


*undoJobUpdate* (it would replay the previousState instructions from the
given job update)
*rollbackToJobUpdate* (you'd pass the JobUpdateKey and it would replay the
instructions from that job update)

> 2. The next problem is related to the way we collect  service cluster
> status. I couldn't find a way to quickly get latest statuses for all
> instances/shards for a job in one query. Instead we query all task statuses
> for a job, them manually iterate through all the statuses and filter the
> latest one in grouped by instance ids. For services with lots of churn on
> tasks statuses that means huge blobs of thrift transferred every time we
> issue a query. I was thinking adding something in this line:
> struct TaskQuery {
>   // TODO(maxim): Remove in 0.7.0. (AURORA-749)
>   8: Identity owner
>   14: string role
>   9: string environment
>   2: string jobName
>   4: set<string> taskIds
>   5: set<ScheduleStatus> statuses
>   7: set<i32> instanceIds
>   10: set<string> slaveHosts
>   11: set<JobKey> jobKeys
>   12: i32 offset
>   13: i32 limit
> *  14: i32 limit_per_instance*
> }
> but I'm less certain on API here so any help would be welcome.
> All the changes we propose would be backward compatible.

> --
> -Igor

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message