aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erb, Stephan" <Stephan....@blue-yonder.com>
Subject Re: Few things we would like to support in aurora scheduler
Date Sun, 19 Jun 2016 15:24:18 GMT

>> The next problem is related to the way we collect  service cluster
>> status. I couldn't find a way to quickly get latest statuses for all
>> instances/shards for a job in one query. Instead we query all task statuses
>> for a job, them manually iterate through all the statuses and filter the
>> latest one in grouped by instance ids. For services with lots of churn on
>> tasks statuses that means huge blobs of thrift transferred every time we
> issue a query. I was thinking adding something in this line:
>
>
>Does a TaskQuery filtering by job key and ACTIVE_STATES solve this?  Still
>includes the TaskConfig, but it's a single query, and probably rarely
>exceeds 1 MB in response payload.

We have a related problem, where we are interested in the status of the last executed cron
job. Unfortunately, ACTIVE_STATES don’t help here. One potential solution I have thought
about was a flag in TaskQuery for enabling server-side sorting of tasks by their latest event
time. We could then query the status of the latest run by using this flag in combination with
limit=1. This could also be composed by the limit_per_instance flag to guarantee the usecase
mentioned here.



On Thu, Jun 16, 2016 at 1:28 PM, Igor Morozov <igmorv@gmail.com> wrote:

> Hi aurora people,
>
> I would like to start a discussion around few things we would like to see
> supported in aurora scheduler. It is based on our experience of integrating
> aurora into Uber infrastructure and I believe all the items I'm going to
> talk about will benefit the community and people running aurora clusters.
>
> 1. We support multiple aurora clusters in different failure domains and we
> run services in those domains. The upgrade workflow for those services
> includes rolling out the same version of a service software to all aurora
> clusters concurrently while monitoring the health status and other service
> vitals that includes like checking error logs, service stats,
> downstream/upstream services health. That means we occasionally need to
> manually trigger a rollback if things go south and rollback all the update
> jobs in all aurora clusters for that particular service. So here are the
> problems we discovered so far with this approach:
>
>        - We don't have an easy way to assign a common unique identifier for
> all JobUpdates in different aurora clusters in order to reconcile them
> later into a single meta update job so to speak. Instead we need to
> generate that ID and keep it in every aurora's JobUpdate
> metadata(JobUpdateRequest.taskConfig). Then in order to get the status the
> upgrade workflow running in different data centers we have to query all
> recent jobs and based on their metadata content try to filter in ones that
> we thing belongs to a currently running upgrade for the service.
>
> We propose to change
> struct JobUpdateRequest {
>   /** Desired TaskConfig to apply. */
>   1: TaskConfig taskConfig
>
>   /** Desired number of instances of the task config. */
>   2: i32 instanceCount
>
>   /** Update settings and limits. */
>   3: JobUpdateSettings settings
>
> *  /**Optional Job Update key's id, if not specified aurora will generate
> one**/*
>
> *  4: optional string id*}
>
> There is potentially another much more involved solution of supporting user
> defined metadata mentioned in this ticket:
> https://issues.apache.org/jira/browse/AURORA-1711
>
>
>     -  All that brings us to a second problem we had to deal with during
> the upgrade:
> We don't have a good way to manually trigger a job update rollback in
> aurora. The use case is again the same, while running multiple update jobs
> in different aurora clusters we have a real production requirement to start
> rolling back update jobs if things are misbehaving and the nature of this
> misbehavior could be potentially very complex. Currently we abort the job
> update and start a new one that would essentially roll cluster forward to a
> previously run version of the software.
>
> We propose a new convenience API to rollback a running or complete
> JobUpdate:
>
> *  /**Rollback job update. */*
> *  Response rollbackJobUpdate(*
> *      /** The update to rollback. */*
> *      1: JobUpdateKey key,*
> *      /** A user-specified message to include with the induced job update
> state change. */*
> *      3: string message)*
>
> 2. The next problem is related to the way we collect  service cluster
> status. I couldn't find a way to quickly get latest statuses for all
> instances/shards for a job in one query. Instead we query all task statuses
> for a job, them manually iterate through all the statuses and filter the
> latest one in grouped by instance ids. For services with lots of churn on
> tasks statuses that means huge blobs of thrift transferred every time we
> issue a query. I was thinking adding something in this line:
> struct TaskQuery {
>   // TODO(maxim): Remove in 0.7.0. (AURORA-749)
>   8: Identity owner
>   14: string role
>   9: string environment
>   2: string jobName
>   4: set<string> taskIds
>   5: set<ScheduleStatus> statuses
>   7: set<i32> instanceIds
>   10: set<string> slaveHosts
>   11: set<JobKey> jobKeys
>   12: i32 offset
>   13: i32 limit
> *  14: i32 limit_per_instance*
> }
>
> but I'm less certain on API here so any help would be welcome.
>
> All the changes we propose would be backward compatible.
>
> --
> -Igor
>


Mime
View raw message