aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Morozov <igm...@gmail.com>
Subject Re: Few things we would like to support in aurora scheduler
Date Tue, 21 Jun 2016 06:42:20 GMT
I created two tickets to track the discussion there:

https://issues.apache.org/jira/browse/AURORA-1721
https://issues.apache.org/jira/browse/AURORA-1722

I'm willing to work on rollback and potentially (depending on a result of
the discussion) on adding TaskQuery flag.

Thanks,
-Igor

On Sun, Jun 19, 2016 at 8:24 AM, Erb, Stephan <Stephan.Erb@blue-yonder.com>
wrote:

>
> >> The next problem is related to the way we collect  service cluster
> >> status. I couldn't find a way to quickly get latest statuses for all
> >> instances/shards for a job in one query. Instead we query all task
> statuses
> >> for a job, them manually iterate through all the statuses and filter the
> >> latest one in grouped by instance ids. For services with lots of churn
> on
> >> tasks statuses that means huge blobs of thrift transferred every time we
> > issue a query. I was thinking adding something in this line:
> >
> >
> >Does a TaskQuery filtering by job key and ACTIVE_STATES solve this?  Still
> >includes the TaskConfig, but it's a single query, and probably rarely
> >exceeds 1 MB in response payload.
>
> We have a related problem, where we are interested in the status of the
> last executed cron job. Unfortunately, ACTIVE_STATES don’t help here. One
> potential solution I have thought about was a flag in TaskQuery for
> enabling server-side sorting of tasks by their latest event time. We could
> then query the status of the latest run by using this flag in combination
> with limit=1. This could also be composed by the limit_per_instance flag to
> guarantee the usecase mentioned here.
>
>
>
> On Thu, Jun 16, 2016 at 1:28 PM, Igor Morozov <igmorv@gmail.com> wrote:
>
> > Hi aurora people,
> >
> > I would like to start a discussion around few things we would like to see
> > supported in aurora scheduler. It is based on our experience of
> integrating
> > aurora into Uber infrastructure and I believe all the items I'm going to
> > talk about will benefit the community and people running aurora clusters.
> >
> > 1. We support multiple aurora clusters in different failure domains and
> we
> > run services in those domains. The upgrade workflow for those services
> > includes rolling out the same version of a service software to all aurora
> > clusters concurrently while monitoring the health status and other
> service
> > vitals that includes like checking error logs, service stats,
> > downstream/upstream services health. That means we occasionally need to
> > manually trigger a rollback if things go south and rollback all the
> update
> > jobs in all aurora clusters for that particular service. So here are the
> > problems we discovered so far with this approach:
> >
> >        - We don't have an easy way to assign a common unique identifier
> for
> > all JobUpdates in different aurora clusters in order to reconcile them
> > later into a single meta update job so to speak. Instead we need to
> > generate that ID and keep it in every aurora's JobUpdate
> > metadata(JobUpdateRequest.taskConfig). Then in order to get the status
> the
> > upgrade workflow running in different data centers we have to query all
> > recent jobs and based on their metadata content try to filter in ones
> that
> > we thing belongs to a currently running upgrade for the service.
> >
> > We propose to change
> > struct JobUpdateRequest {
> >   /** Desired TaskConfig to apply. */
> >   1: TaskConfig taskConfig
> >
> >   /** Desired number of instances of the task config. */
> >   2: i32 instanceCount
> >
> >   /** Update settings and limits. */
> >   3: JobUpdateSettings settings
> >
> > *  /**Optional Job Update key's id, if not specified aurora will generate
> > one**/*
> >
> > *  4: optional string id*}
> >
> > There is potentially another much more involved solution of supporting
> user
> > defined metadata mentioned in this ticket:
> > https://issues.apache.org/jira/browse/AURORA-1711
> >
> >
> >     -  All that brings us to a second problem we had to deal with during
> > the upgrade:
> > We don't have a good way to manually trigger a job update rollback in
> > aurora. The use case is again the same, while running multiple update
> jobs
> > in different aurora clusters we have a real production requirement to
> start
> > rolling back update jobs if things are misbehaving and the nature of this
> > misbehavior could be potentially very complex. Currently we abort the job
> > update and start a new one that would essentially roll cluster forward
> to a
> > previously run version of the software.
> >
> > We propose a new convenience API to rollback a running or complete
> > JobUpdate:
> >
> > *  /**Rollback job update. */*
> > *  Response rollbackJobUpdate(*
> > *      /** The update to rollback. */*
> > *      1: JobUpdateKey key,*
> > *      /** A user-specified message to include with the induced job
> update
> > state change. */*
> > *      3: string message)*
> >
> > 2. The next problem is related to the way we collect  service cluster
> > status. I couldn't find a way to quickly get latest statuses for all
> > instances/shards for a job in one query. Instead we query all task
> statuses
> > for a job, them manually iterate through all the statuses and filter the
> > latest one in grouped by instance ids. For services with lots of churn on
> > tasks statuses that means huge blobs of thrift transferred every time we
> > issue a query. I was thinking adding something in this line:
> > struct TaskQuery {
> >   // TODO(maxim): Remove in 0.7.0. (AURORA-749)
> >   8: Identity owner
> >   14: string role
> >   9: string environment
> >   2: string jobName
> >   4: set<string> taskIds
> >   5: set<ScheduleStatus> statuses
> >   7: set<i32> instanceIds
> >   10: set<string> slaveHosts
> >   11: set<JobKey> jobKeys
> >   12: i32 offset
> >   13: i32 limit
> > *  14: i32 limit_per_instance*
> > }
> >
> > but I'm less certain on API here so any help would be welcome.
> >
> > All the changes we propose would be backward compatible.
> >
> > --
> > -Igor
> >
>
>
>


-- 
-Igor

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message