aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Huang <texasred2...@hotmail.com>
Subject Re: Review Request 60714: aurora job restart request should be idempotent and retryable
Date Fri, 07 Jul 2017 20:20:05 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60714/
-----------------------------------------------------------

(Updated July 7, 2017, 8:20 p.m.)


Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh Kumar Shanmugham,
and Zameer Manji.


Changes
-------

Removed unused property in restarter.


Bugs: AURORA-1940
    https://issues.apache.org/jira/browse/AURORA-1940


Repository: aurora


Description
-------

aurora job restart request should be idempotent and retryable.

There was a recent change to the Aurora client to provide "at most once" instead of "at least
once" retries for non-idempotent operations. See: https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6

Technically, `aurora job restart` is a non-idempotent operation, thus it was not retried.
However, when a transport exception occurs, the operator has to babysit simple operations
like aurora job restart if it were not retried. Compared to the requests that were causing
problems (admin tasks, job creating, updates, etc.), restarts in general should be retried
rather than erring on the side of caution.

Job restart can be divided into three steps:
- 1. get instance status (getTasksWithoutConfigs)
- 2. restart shards (restartShards)
- 3. watch instance until healthy (getTasksWithoutConfigs)

TTransport exception can be thrown at each of these step, ideally we should make __ALL__ of
the steps above __idempotent__ and retryable. 

However, given that the `watch` logic is also used in --wait options of job create/kill command),
making this step retryable will have an impact job create/kill commands as well. IMO, it's
probably OK for us to retry the watch during job create/kill since the watch step is readonly.

In this CR, I will make the first __TWO__ steps retryable since they are self-contained in
job restart command.
If people are OK with this strategy, I'll make the `watch` step retryable as well.


Diffs (updated)
-----

  src/main/python/apache/aurora/client/api/restarter.py 6600c6b608ee70a02e11ca8550a06b7fb76fd863

  src/test/python/apache/aurora/api_util.py bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d 
  src/test/python/apache/aurora/client/api/test_restarter.py a81003e79e090bbfa151431366f5394f405d99eb

  src/test/python/apache/aurora/client/cli/test_restart.py cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8



Diff: https://reviews.apache.org/r/60714/diff/2/

Changes: https://reviews.apache.org/r/60714/diff/1-2/


Testing
-------

./build-support/jenkins/build.sh
 
 To test the retry logic, random TTranport exceptions were thrown at the client side(for all
api calls to the scheduler proxy), aurora job restart command was issued against scheduler,
and the output was like:
```
 vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello
 WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
 WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
 INFO] Performing rolling restart of job devcluster/vagrant/test/hello (instances: [0])
 INFO] Restarting instances: [0]
 WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
 WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
 INFO] Watching instances: [0]
 WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
 INFO] Detected RUNNING instance 0
 WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
 WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
 WARN] Transport error communicating with scheduler: Timed out talking to http://aurora.local:8081/api,
retrying...
 INFO] Instance 0 has been up and healthy for at least 30 seconds
 INFO] All instances were restarted successfully
 Job devcluster/vagrant/test/hello restarted successfully
```


Thanks,

Kai Huang


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message