aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehrdad Nurolahzade <mehr...@apache.org>
Subject Re: Review Request 58768: AURORA-1924 Aurora client should reconcile idempotent job creations
Date Wed, 03 May 2017 16:56:25 GMT


> On April 27, 2017, 12:06 p.m., Mehrdad Nurolahzade wrote:
> > This does not work as intended in presence of multiple clients.
> > 
> > Example timeline:
> > 
> > Client 1|Client 2|Request|Scheduler|Response
> > --------|--------|-------|---------|--------
> > Create J|        |OK     |Created  |FAIL      // Delivering response failed, client
1 will retry after 5 seconds
> >         |Kill J  |OK     |Killed   |OK        // Client 2 successfully killed J
> > Create J|        |OK     |Created  |OK        // Client 1 will conclude that it
has successfully created J while the global state has been comprimised.
> 
> Mehrdad Nurolahzade wrote:
>     Reflecting on the behavior change introduced by this patch, I am no longer concerned.
Here is the justification.
>     
>     In the multi-client world of Aurora where clients can concurrently access scheduler
and submit requests over unreliable communication channels, one of the following four situations
can happen when it comes to job creation:
>     
>     1. **One request**: job create request is received, processed, and response is delivered
to client. Request is successful if key does not exist, and is failed otherwise (`ResponseCode.INVALID_REQUEST`
with no `JobCreateResult`).
>     2. **One request, retried**: job create request/response is not received/delivered,
client retries request after 5 seconds. If it was received the first time it is softly rejected
this time (`ResponseCode.INVALID_REQUEST` with a `JobCreateResult`). If it was not received
the first time, it is processed this time and job is either created or request fails (case
1 above).
>     3. **Two requests, read-only operation in between**: job create request/response
is not received/delivered, client one retries request after 5 seconds, scheduler handles a
read-only operation from client two associated with the same job in between the two requests
from client one. The concern here is that the client two might make a decision based on the
state of the job that might no longer be valid after the retry from client one. But, this
also happens today. Aurora does not provide atomic CAS operation support, therefore there
is no gurantee that scheduler state does not change in between a read and the follow-up mutable
operation.
>     4. **Two request, mutable operation in between**: job create request/response is
not received/delivered, client one retries request after 5 seconds, scheduler handles a mutable
operation from client two associated with the same job in between the two requests from client
one (the scenario depicted in my previous comment). The concern here is that client two might
make a decision based on the modification it just made to the state of the job that might
no longer be valid after the retry from client one. Again, this is the same behavior that
exists today. Aurora does not support multi-operation transactions, therefore, scheduler state
can change in between a mutable operation and follow-up read-only or mutable operations.
>     
>     I believe we can review and accept this patch.
> 
> David McLaughlin wrote:
>     Now that we shipped the change to not automatically retry job create - is this still
necessary?

It provides slightly better user experience, but we can live without it.

**Without this patch**: When a job create fails, it's not automatically retried, therefore
client has to either query scheduler state to verify state or keep resubmitting command until
it either succeeds (job is created) or fails (job already exists).

**With this patch**: When a job create fails, it is autoamtically retried (assuming that we
turn on `retry=True` argument on `createJob()` client API), and ultimately succeeds whether
it creates the job or learns that it has been created in one of the previous tries.


- Mehrdad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58768/#review173237
-----------------------------------------------------------


On May 2, 2017, 9:33 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58768/
> -----------------------------------------------------------
> 
> (Updated May 2, 2017, 9:33 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Bugs: AURORA-1924
>     https://issues.apache.org/jira/browse/AURORA-1924
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Aurora scheduler rejects a request to create a job if a job with the same key already
exists. Aurora client exits with an error once it receives a response with `ResponseCode.INVALID_REQUEST`
from scheduler in this case.
> 
> However, an attempt to create a job with the exact same configuration and number of instances
is essentially idempotent. Scheduler can detect this situation, ignore it, and signal client
to treat operation as successful; client warns user about existing job but does not fail the
operation.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 3749531b5412d7ca217736aa85eed8e6606225ad

>   src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java 059fbb86a575f5b3d78a63c9a7b5a9eebb6cb3ae

>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc

>   src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
016859ca3bf83f64d2576b4c7109729770f9e25c 
>   src/test/python/apache/aurora/client/cli/test_create.py 3b09bb25e919bac2795ccd56bd98657b1f98690b

> 
> 
> Diff: https://reviews.apache.org/r/58768/diff/1/
> 
> 
> Testing
> -------
> 
> - Manually under Vagrant
> - End to end test script
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message