aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David McLaughlin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1749) Get a support for distributed job update coordination
Date Fri, 12 Aug 2016 22:50:20 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419641#comment-15419641
] 

David McLaughlin commented on AURORA-1749:
------------------------------------------

Not clear to me why you would need (1).  

I filed https://issues.apache.org/jira/browse/AURORA-1711 (and we will submit a patch for
this soon) that can support (2). You could also use the message field to contain such data.






> Get a support for distributed job update coordination
> -----------------------------------------------------
>
>                 Key: AURORA-1749
>                 URL: https://issues.apache.org/jira/browse/AURORA-1749
>             Project: Aurora
>          Issue Type: Story
>          Components: Scheduler
>            Reporter: Igor Morozov
>            Assignee: Igor Morozov
>
> This is for a use case to update jobs that are the same but spread across multiple datacenters
and managed by different aurora clusters.
> For example we have a service job test-service that runs in two datacenters dc1 and dc1.
> Logically the job needs to be updated in a single lock step across multiple data centers
and if any job update fails and goes into ROLLING_BACK state
> all the others need to start a rollback as well.
>  
> This is what we want to achieve with this change:
> 1. Coordinator starts an upgrade:
>     dc1: -> starting update1 for job1
>     dc2: -> staring update2 for job2
> 2. Coordinator:
>     dc1: update1 is done, enters paused state
>     dc2: update2 has failed, rolling back
> 3. Coordinator:
>     dc1: starts rolling back update 1
>     dc2: update 2 is rolled back
> 4. Coordinator:
>     dc1: update 1 is rolled back
>     dc2: update 2 is rolled back
> Currently step 2 is impossible in aurora as job update enters the terminal state and
could not be rolled back from it.
> There was some discussion in AURORA-1721 ticket regarding using another job update to
roll forward the job to a previous version effectively simulating a rollback. But now in order
to reconcile the state of an actual update operation one would need to consider two or more
update jobs and differentiate between normal ROLLED_FORWARD vs ROLLED_FORWARD(rollback) jobs.
That feels quite artificial error prone. We believe an ability to run a coordinate job update
across multiple data centers should be a first class citizen in aurora



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message