mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Mahler" <benjamin.mah...@gmail.com>
Subject Re: Review Request 39873: Fixed master to properly handle status updates when multiple of them are enqueued on the slave.
Date Thu, 05 Nov 2015 01:00:42 GMT


> On Nov. 4, 2015, 2:43 a.m., Ben Mahler wrote:
> > src/master/master.cpp, line 6039
> > <https://reviews.apache.org/r/39873/diff/2/?file=1114216#file1114216line6039>
> >
> >     Why not have the same logic here to be defensive? Or do you intend to guard
against it? Just seems weird to allow it in one of the cases but not the other.
> 
> Vinod Kone wrote:
>     i didn't do it in the master's case because it would be a bug in the code if the
master is trying to call updateTask() for an already terminated task! we have to do it for
updates from slave because we don't control the executor's updates. i could make it a CHECK
for the master case though.

Yeah I know it's a bug in the if it happens :)

Also, if you leave as it without a guard or the same logic, what happens when we introduce
the bug in the master that transitions to another terminal state? I'm fine with either the
guard (adding a CHECK) or using the same logic.


> On Nov. 4, 2015, 2:43 a.m., Ben Mahler wrote:
> > src/master/master.cpp, lines 6004-6013
> > <https://reviews.apache.org/r/39873/diff/2/?file=1114216#file1114216line6004>
> >
> >     Just looking at the patch that introduced this bug, why are we removing the
out-of-order update prevention? Don't see any mention of this in the stuff around https://issues.apache.org/jira/browse/MESOS-2864.
> 
> Vinod Kone wrote:
>     There are couple reasons for this
>     
>     1) it's hard to figure out if an update is an out of order update. for example, if
TASK_STARTING, TASK_RUNNING and TASK_FINISHED are all enqued on the slave, the master might
receive non-terminal (TASK_RUNNING) and terminal (TASK_FINISHED) after it transitioned the
task to terminal state (TASK_STARTING update w/ latest state as TASK_FINISHED).
>     
>     2) when updateTask() is called, typically an update is being forwarded to the scheduler.
so while it might be ok to not transition the (latest) state of the task, it is important
to set the task.status_update_state and task.status_update_uuid correctly. so we shouldn't
short-circuit the function without setting these.
>     
>     makes sense?

Isn't the reason that the check was just wrong? Or did we break it when we started storing
the latest state? It looks wrong to me here, since the task->state is the latest state:
https://reviews.apache.org/r/38051/diff/6#0


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39873/#review105027
-----------------------------------------------------------


On Nov. 2, 2015, 10:34 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39873/
> -----------------------------------------------------------
> 
> (Updated Nov. 2, 2015, 10:34 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2864
>     https://issues.apache.org/jira/browse/MESOS-2864
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The master now doesn't change the latest state of a task if it has already terminated.
But it still updates the status update state and uuid.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2bc5a97a5b50c8a8a9902c47b2e9e3b5216d97ea 
> 
> Diff: https://reviews.apache.org/r/39873/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran the "RecoverCompletedExecutor" test 100 times as it was failing most of the time
without this fix.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message