mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Review Request 46187: Terminate when receiving the ACK of terminal status update.
Date Wed, 14 Sep 2016 03:08:17 GMT


> On Sept. 6, 2016, 11:33 a.m., Vinod Kone wrote:
> > src/launcher/executor.cpp, line 682
> > <https://reviews.apache.org/r/46187/diff/5/?file=1488422#file1488422line682>
> >
> >     hmm. what's the guarantee that an HTTP based executor receives an ACK within
a second? what if the agent is down?
> 
> Qian Zhang wrote:
>     If executor does not receives an ACK from agent within 1 second, that means there
should be something wrong in agent, so with this code, the executor will exit with -1 and
a message so that we can catch such situation. Maybe we should enlarge this timeout a bit
(e.g., 3 seconds) to be safer?
> 
> Vinod Kone wrote:
>     Executor exiting with -1 code when an agent is down or restarting (probably for an
upgrade) seems unfortunate. Since we allow agents to be down for upto "MESOS_RECOVERY_TIMEOUT"
(default 15 mins) if "MESOS_CHECKPOINT" is set, maybe the command executor could wait for
"MESOS_RECOVERY_TIMEOUT" if it is disconnected? If it is connected then yes, it probably should
wait for less time (1s is too short?) and then exit because that's seems like there is something
wrong with the agent. Does that make sense?
> 
> Qian Zhang wrote:
>     Yes, it makes sense. So what we need to do is:
>     1. For adapter based executor, keep what I am doing now, i.e., in `reaped()` do the
self termination with 0 as exit code after 1s.
>     2. For HTTP based executor, in `reaped()`, check if `MESOS_CHECKPOINT` is set:
>        2.1 If it is not set, do the self termination with 1 as exit code after 3s (1s
is too short).
>        2.2 If it is set, check if the executor is disconnected from agent, if yes, do
the self termination with 1 as exit code after `MESOS_RECOVERY_TIMEOUT`, if no, do the self
termination with 1 as exit code after 3s.
>     
>     Please let me know if you have further comments, I will update the patch soon.
> 
> Huadong Liu wrote:
>     Hi Qian, Vinod, I have tried 3s in Yelp's mesos/chronos environments. It helped at
some extent, but seemed not enough for all the cases. Since the executor will terminate once
an ACK is received, why not delay for an extended period of time, say 10s, or even 30s for
rare cases (e.g. agent being extermely busy)?
> 
> Vinod Kone wrote:
>     For 2.2) what if agent goes down right after executor checks for connectedness in
`reaped()`? seems like it would only wait 3s which seems non-ideal. Maybe this logic is becoming
too complex? To simplify, how about having the executor always wait for a resonable amount
of time (60s) in `reaped()`? HTTP based executor would exit early if they get an ACK for terminal
update. How does that sound? We would need to make sure that our unit tests (that use command
executor) are not slowed down because of this.
> 
> Qian Zhang wrote:
>     > what if agent goes down right after executor checks for connectedness in reaped()?
>     
>     Yes, I thought about this case as well, to handle it, it will make the logic here
even more complex though the logic of what I posted above is already complex and not that
readable. So I agree we should simplify it, what about the following?
>     1. For adapter based executor, in `reaped()` do the self termination with 0 as exit
code after 1s.
>     2. For HTTP based executor, in `reaped()`, do the self termination with 1 as exit
code after 60s (@huadongliu, this should satisfy your requirement :-).
>     Or you think we should always wait for 60s for both adapter based executor and HTTP
based executor?

your plan sounds good to me.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46187/#review147808
-----------------------------------------------------------


On Sept. 7, 2016, 2:03 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46187/
> -----------------------------------------------------------
> 
> (Updated Sept. 7, 2016, 2:03 p.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-5276
>     https://issues.apache.org/jira/browse/MESOS-5276
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Terminate when receiving the ACK of terminal status update.
> 
> 
> Diffs
> -----
> 
>   src/launcher/executor.cpp 5370634ef9e6f3ac9717fed71f6a77707026a16a 
> 
> Diff: https://reviews.apache.org/r/46187/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message