aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehrdad Nurolahzade <mehr...@apache.org>
Subject Re: Review Request 60714: aurora job restart request should be idempotent and retryable
Date Sun, 16 Jul 2017 22:41:49 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60714/#review180646
-----------------------------------------------------------


Ship it!




Ship It!

- Mehrdad Nurolahzade


On July 15, 2017, 11 p.m., Kai Huang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60714/
> -----------------------------------------------------------
> 
> (Updated July 15, 2017, 11 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh Kumar Shanmugham,
and Zameer Manji.
> 
> 
> Bugs: AURORA-1940
>     https://issues.apache.org/jira/browse/AURORA-1940
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> aurora job restart request should be idempotent and retryable.
> 
> There was a recent change to the Aurora client to provide "at most once" instead of "at
least once" retries for non-idempotent operations. See: https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6
> 
> Technically, `aurora job restart` is a non-idempotent operation, thus it was not retried.
However, when a transport exception occurs, the operator has to babysit simple operations
like aurora job restart if it were not retried. Compared to the requests that were causing
problems (admin tasks, job creating, updates, etc.), restarts in general should be retried
rather than erring on the side of caution.
> 
> Job restart can be divided into three steps:
> - 1. get instance status (getTasksWithoutConfigs)
> - 2. restart shards (restartShards)
> - 3. watch instance until healthy (getTasksWithoutConfigs)
> 
> TTransport exception can be thrown at each of these step, ideally we should make __ALL__
of the steps above __idempotent__ and retryable. 
> 
> The only trickey part is that the `watch` logic is also used in --wait-until options
of job create/add command, making this step retryable will have an impact on job create/add
commands as well.
> 
> In this CR, I will make the first __TWO__ steps retryable since they are self-contained
in job restart command.
> If people are OK with this strategy, I'll make the `watch` step retryable as well.
> 
> __Updates__:
> I made the `watch` step in `aurora job restart` retryable in the 3rd revision. Note the
`InstanceWatcher` in `Restarter` relies on `StatusHelper.get_tasks` method to query the latest
instance status. This method is also invoked by JobMonitor during job create/add. 
> 
> The solution here is to pass a `retry` flag to this method, so that it can be customized
for different scenarios.
> 
> __Open Question__:
> Currently, `aurora job create --wait-until RUNNING` fails immediately if there is any
transport exception during the `wait` step. Do we need retry for the `wait` step?
> 
> There are three phases where transport exception could be thrown:
> - Phase I: job create request was not sent to scheduler yet, we should retry the command.
> - Phase II: job create request was already sent to scheduler, we should not retry the
command, and exit.
> - Phase III: job create command gets to the `wait` step, that means the job create request
must have already been sent to scheduler, we should not have to retry the command.
> 
> So overall, we do not retry for the `wait` step. The same principle is applicable to
`job add` command.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/client/api/instance_watcher.py a35fb22edc9a5bb12f286856b170f5cb0170eceb

>   src/main/python/apache/aurora/client/api/restarter.py 6600c6b608ee70a02e11ca8550a06b7fb76fd863

>   src/main/python/apache/aurora/client/api/task_util.py fb7c76f42171737701c740fddf893f07211a47a0

>   src/test/python/apache/aurora/api_util.py bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d

>   src/test/python/apache/aurora/client/api/test_instance_watcher.py 8fd419f812c6e038e6f7fba0a662da0a6410b794

>   src/test/python/apache/aurora/client/api/test_job_monitor.py 537abd38cd13bcbb615c0224c983868b29027379

>   src/test/python/apache/aurora/client/api/test_restarter.py a81003e79e090bbfa151431366f5394f405d99eb

>   src/test/python/apache/aurora/client/api/test_task_util.py 365ef59857003402c4354e06abeaea947d49322b

>   src/test/python/apache/aurora/client/cli/test_command_hooks.py a44a25fb1e4357780adca4403b6010d160066b21

>   src/test/python/apache/aurora/client/cli/test_create.py 3b09bb25e919bac2795ccd56bd98657b1f98690b

>   src/test/python/apache/aurora/client/cli/test_plugins.py 762735e00bd8051ab64674e7ee359758d3cbeca7

>   src/test/python/apache/aurora/client/cli/test_restart.py cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8

> 
> 
> Diff: https://reviews.apache.org/r/60714/diff/3/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
>  
>  To test the retry logic, TTranport exceptions were randomly thrown at the client side(for
all api calls to the scheduler proxy, there is a 50% chance of throwing TTranport exception),
aurora job restart command was issued against scheduler, and the output was like:
> ```
>  vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Performing rolling restart of job devcluster/vagrant/test/hello (instances: [0])
>  INFO] Restarting instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Watching instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Detected RUNNING instance 0
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Instance 0 has been up and healthy for at least 30 seconds
>  INFO] All instances were restarted successfully
>  Job devcluster/vagrant/test/hello restarted successfully
> ```
> 
> __Updates__:
> 
> To ensure that job create/add/kill commands work as expected, TTranport exceptions were
randomly thrown at the client side, aurora job create/add/kill commands were issued against
scheduler:
> 
> __JOB CREATE__:
> Job create command succeeds when no transport exception was thrown. 
> ```
> vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello /vagrant/hello.aurora
>  INFO] Creating job hello
>  INFO] Checking status of devcluster/vagrant/test/hello
> Job create succeeded: job url=http://aurora.local:8081/scheduler/vagrant/test/hello
> ```
> 
> Job create command __retries__(getTasksWithoutConfigs) when transport exception were
thrown __BEFORE__ the job create request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello /vagrant/hello.aurora
>  INFO] Creating job hello
>  INFO] Checking status of devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
> Job create succeeded: job url=http://aurora.local:8081/scheduler/vagrant/test/hello
> ```
> 
> Job create command __does NOT retry__ when transport exception were thrown __AFTER__
the job create request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello /vagrant/hello.aurora
>  INFO] Creating job hello
> Transport error communicating with scheduler during non-idempotent operation: Timed out
talking to http://aurora.local:8081/api, not retrying
> ```
> 
> __JOB ADD__:
> Job add command succeeds when no transport exception was thrown. 
> ```
> vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
>  INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task config of instance
0
> ```
> 
> Job add command __retries__(getTasksWithoutConfigs) when transport exception were thrown
__BEFORE__ the job add request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task config of instance
0
> ```
> 
> Job add command __does NOT retry__ when transport exception were thrown __AFTER__ the
job add request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
>  INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task config of instance
0
> Transport error communicating with scheduler during non-idempotent operation: Timed out
talking to http://aurora.local:8081/api, not retrying
> ```
> 
> __JOB KILL__:
> Job kill command succeeds when no transport exception was thrown. 
> ```
> vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
>  INFO] Killing tasks for job: devcluster/vagrant/test/hello
>  INFO] Instances to be killed: [0]
> Successfully killed instances [0]
> Job killall succeeded
> ```
> 
> Job kill command __retries__(getTasksWithoutConfigs) when transport exception were thrown
__BEFORE__ the job kill request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Killing tasks for job: devcluster/vagrant/test/hello
>  INFO] Instances to be killed: [0]
> Successfully killed instances [0]
> Job killall succeeded
> ```
> 
> Job kill command __does NOT retry__ when transport exception were thrown __AFTER__ the
job kill request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
>  INFO] Killing tasks for job: devcluster/vagrant/test/hello
>  INFO] Instances to be killed: [0]
> Transport error communicating with scheduler during non-idempotent operation: Timed out
talking to http://aurora.local:8081/api, not retrying
> ```
> 
> 
> Thanks,
> 
> Kai Huang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message