aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aurora ReviewBot <wfar...@apache.org>
Subject Re: Review Request 60714: aurora job restart request should be idempotent and retryable
Date Fri, 07 Jul 2017 20:35:59 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60714/#review179935
-----------------------------------------------------------


Ship it!




Master (a922b05) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On July 7, 2017, 8:20 p.m., Kai Huang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60714/
> -----------------------------------------------------------
> 
> (Updated July 7, 2017, 8:20 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh Kumar Shanmugham,
and Zameer Manji.
> 
> 
> Bugs: AURORA-1940
>     https://issues.apache.org/jira/browse/AURORA-1940
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> aurora job restart request should be idempotent and retryable.
> 
> There was a recent change to the Aurora client to provide "at most once" instead of "at
least once" retries for non-idempotent operations. See: https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6
> 
> Technically, `aurora job restart` is a non-idempotent operation, thus it was not retried.
However, when a transport exception occurs, the operator has to babysit simple operations
like aurora job restart if it were not retried. Compared to the requests that were causing
problems (admin tasks, job creating, updates, etc.), restarts in general should be retried
rather than erring on the side of caution.
> 
> Job restart can be divided into three steps:
> - 1. get instance status (getTasksWithoutConfigs)
> - 2. restart shards (restartShards)
> - 3. watch instance until healthy (getTasksWithoutConfigs)
> 
> TTransport exception can be thrown at each of these step, ideally we should make __ALL__
of the steps above __idempotent__ and retryable. 
> 
> However, given that the `watch` logic is also used in --wait options of job create/kill
command), making this step retryable will have an impact job create/kill commands as well.
IMO, it's probably OK for us to retry the watch during job create/kill since the watch step
is readonly.
> 
> In this CR, I will make the first __TWO__ steps retryable since they are self-contained
in job restart command.
> If people are OK with this strategy, I'll make the `watch` step retryable as well.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/client/api/restarter.py 6600c6b608ee70a02e11ca8550a06b7fb76fd863

>   src/test/python/apache/aurora/api_util.py bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d

>   src/test/python/apache/aurora/client/api/test_restarter.py a81003e79e090bbfa151431366f5394f405d99eb

>   src/test/python/apache/aurora/client/cli/test_restart.py cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8

> 
> 
> Diff: https://reviews.apache.org/r/60714/diff/2/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
>  
>  To test the retry logic, random TTranport exceptions were thrown at the client side(for
all api calls to the scheduler proxy), aurora job restart command was issued against scheduler,
and the output was like:
> ```
>  vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Performing rolling restart of job devcluster/vagrant/test/hello (instances: [0])
>  INFO] Restarting instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Watching instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Detected RUNNING instance 0
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
>  INFO] Instance 0 has been up and healthy for at least 30 seconds
>  INFO] All instances were restarted successfully
>  Job devcluster/vagrant/test/hello restarted successfully
> ```
> 
> 
> Thanks,
> 
> Kai Huang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message