aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igor Morozov (JIRA)" <>
Subject [jira] [Commented] (AURORA-1749) Get a support for distributed job update coordination
Date Fri, 12 Aug 2016 21:34:20 GMT


Igor Morozov commented on AURORA-1749:

For example coordinator needs to understand whether to run a rollback at some point for the
job in ROLLED_FORWARD state. 
That mean we wound need to keep some kind of metadata in a job update configuration being
scheduled indicating:
1. That JobUpdate is in fact a rollback
2. The JobUpdate has some kind of identifier that helps us to relate it to job updates in
other datacenters

That brings all kind of corner cases we need to consider. What if JobUpdate(rollback) was
aborted? What if it was aborted multiple times, how deep we need to query for a history of
updates? What if there is another job update that went right after the one we need to rollback?

So ideally a distributed job update that is still running (or paused in some kind of waiting
for completion state) works as a distributed lock for updates.

> Get a support for distributed job update coordination
> -----------------------------------------------------
>                 Key: AURORA-1749
>                 URL:
>             Project: Aurora
>          Issue Type: Story
>          Components: Scheduler
>            Reporter: Igor Morozov
>            Assignee: Igor Morozov
> This is for a use case to update jobs that are the same but spread across multiple datacenters
and managed by different aurora clusters.
> For example we have a service job test-service that runs in two datacenters dc1 and dc1.
> Logically the job needs to be updated in a single lock step across multiple data centers
and if any job update fails and goes into ROLLING_BACK state
> all the others need to start a rollback as well.
> This is what we want to achieve with this change:
> 1. Coordinator starts an upgrade:
>     dc1: -> starting update1 for job1
>     dc2: -> staring update2 for job2
> 2. Coordinator:
>     dc1: update1 is done, enters paused state
>     dc2: update2 has failed, rolling back
> 3. Coordinator:
>     dc1: starts rolling back update 1
>     dc2: update 2 is rolled back
> 4. Coordinator:
>     dc1: update 1 is rolled back
>     dc2: update 2 is rolled back
> Currently step 2 is impossible in aurora as job update enters the terminal state and
could not be rolled back from it.
> There was some discussion in AURORA-1721 ticket regarding using another job update to
roll forward the job to a previous version effectively simulating a rollback. But now in order
to reconcile the state of an actual update operation one would need to consider two or more
update jobs and differentiate between normal ROLLED_FORWARD vs ROLLED_FORWARD(rollback) jobs.
That feels quite artificial error prone. We believe an ability to run a coordinate job update
across multiple data centers should be a first class citizen in aurora

This message was sent by Atlassian JIRA

View raw message