aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: Review Request 58611: Bump initial_task_kill_retry_interval to 15s.
Date Wed, 26 Apr 2017 01:32:34 GMT


> On April 25, 2017, 6:20 p.m., David McLaughlin wrote:
> > How is the affect of changes like this measured? Seems very hunch-driven, whereas
other potential performance reviews were met with requests for methodology, etc.
> 
> David McLaughlin wrote:
>     Also, generally good to have at least two Ship Its per review? Let's make sure we
follow that convention.

I don't think this needs to be measured. Just consider the following:
1. Thermos gives a task up to 60s to terminate.
2. Once the process terminates thermos sends a `TASK_KILLED` to the agent, which forwards
this to Aurora.
3. Aurora retries task kill every 5s, which means for a process that takes any time to drain
it will send up to 12 `TASK_KILL` messages while waiting for the `TASK_KILLED` response.
4. This change reduces the retries to 4, which makes far more sense to me.

Our values (60s in Thermos) and (5s in Aurora) don't align at all.

Agreed that a better explanation is needed here, and two ship its.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58611/#review173008
-----------------------------------------------------------


On April 21, 2017, 3:36 a.m., Stephan Erb wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58611/
> -----------------------------------------------------------
> 
> (Updated April 21, 2017, 3:36 a.m.)
> 
> 
> Review request for Aurora and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> It is not very common that kills are dropped by Mesos and have to be retried
> by Aurora. It therefore makes sense to slightly increase the retry timeout
> so that we don't retry needlessly when Thermos is still busy executing
> the lifecycle methods.
> 
> By default, Thermos uses the following kill escalation sequence:
> 
>   * /quitquitquit
>   * wait 5s
>   * /abortabortabort
>   * wait 5s
>   * SIGTERM
>   * wait up to 1 minute
>   * SIGKILL
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/reconciliation/ReconciliationModule.java
e076e802f8920b37cef202520c7fbe59724dd06d 
> 
> 
> Diff: https://reviews.apache.org/r/58611/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message