aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehrdad Nurolahzade <mehr...@apache.org>
Subject Re: Review Request 58768: AURORA-1924 Aurora client should reconcile idempotent job creations
Date Wed, 03 May 2017 15:46:21 GMT


> On April 27, 2017, 12:06 p.m., Mehrdad Nurolahzade wrote:
> > This does not work as intended in presence of multiple clients.
> > 
> > Example timeline:
> > 
> > Client 1|Client 2|Request|Scheduler|Response
> > --------|--------|-------|---------|--------
> > Create J|        |OK     |Created  |FAIL      // Delivering response failed, client
1 will retry after 5 seconds
> >         |Kill J  |OK     |Killed   |OK        // Client 2 successfully killed J
> > Create J|        |OK     |Created  |OK        // Client 1 will conclude that it
has successfully created J while the global state has been comprimised.

Reflecting on the behavior change introduced by this patch, I am no longer concerned. Here
is the justification.

In the multi-client world of Aurora where clients can concurrently access scheduler and submit
requests over unreliable communication channels, one of the following four situations can
happen when it comes to job creation:

1. **One request**: job create request is received, processed, and response is delivered to
client. Request is successful if key does not exist, and is failed otherwise (`ResponseCode.INVALID_REQUEST`
with no `JobCreateResult`).
2. **One request, retried**: job create request/response is not received/delivered, client
retries request after 5 seconds. If it was received the first time it is softly rejected this
time (`ResponseCode.INVALID_REQUEST` with a `JobCreateResult`). If it was not received the
first time, it is processed this time and job is either created or request fails (case 1 above).
3. **Two requests, read-only operation in between**: job create request/response is not received/delivered,
client one retries request after 5 seconds, scheduler handles a read-only operation from client
two associated with the same job in between the two requests from client one. The concern
here is that the client two might make a decision based on the state of the job that might
no longer be valid after the retry from client one. But, this also happens today. Aurora does
not provide atomic CAS operation support, therefore there is no gurantee that scheduler state
does not change in between a read and the follow-up mutable operation.
4. **Two request, mutable operation in between**: job create request/response is not received/delivered,
client one retries request after 5 seconds, scheduler handles a mutable operation from client
two associated with the same job in between the two requests from client one (the scenario
depicted in my previous comment). The concern here is that client two might make a decision
based on the modification it just made to the state of the job that might no longer be valid
after the retry from client one. Again, this is the same behavior that exists today. Aurora
does not support multi-operation transactions, therefore, scheduler state can change in between
a mutable operation and follow-up read-only or mutable operations.

I believe we can review and accept this patch.


- Mehrdad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58768/#review173237
-----------------------------------------------------------


On May 2, 2017, 9:33 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58768/
> -----------------------------------------------------------
> 
> (Updated May 2, 2017, 9:33 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Bugs: AURORA-1924
>     https://issues.apache.org/jira/browse/AURORA-1924
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Aurora scheduler rejects a request to create a job if a job with the same key already
exists. Aurora client exits with an error once it receives a response with `ResponseCode.INVALID_REQUEST`
from scheduler in this case.
> 
> However, an attempt to create a job with the exact same configuration and number of instances
is essentially idempotent. Scheduler can detect this situation, ignore it, and signal client
to treat operation as successful; client warns user about existing job but does not fail the
operation.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 3749531b5412d7ca217736aa85eed8e6606225ad

>   src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java 059fbb86a575f5b3d78a63c9a7b5a9eebb6cb3ae

>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc

>   src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
016859ca3bf83f64d2576b4c7109729770f9e25c 
>   src/test/python/apache/aurora/client/cli/test_create.py 3b09bb25e919bac2795ccd56bd98657b1f98690b

> 
> 
> Diff: https://reviews.apache.org/r/58768/diff/1/
> 
> 
> Testing
> -------
> 
> - Manually under Vagrant
> - End to end test script
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message