mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brenden Matthews" <bren...@diddyinc.com>
Subject Re: Review Request: Terminate correct tasks when a slave disconnects.
Date Mon, 06 May 2013 18:05:02 GMT


> On May 6, 2013, 5:39 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 1776
> > <https://reviews.apache.org/r/10951/diff/2/?file=288156#file288156line1776>
> >
> >     we don't want to remove all the tasks on this slave! only those that belong
to this framework. so,
> >     
> >     foreachvalue (Task* task, utils::copy(slave->tasks)) {
> >       // Remove the task if it belongs to the framework
> >       // being removed. 
> >       if (task->framework_id() == framework->id) {
> >        ...
> >        ...
> >       }
> >     }
> >     
> >

Woops.  Shouldn't have rushed through that.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10951/#review20197
-----------------------------------------------------------


On May 6, 2013, 6:04 p.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10951/
> -----------------------------------------------------------
> 
> (Updated May 6, 2013, 6:04 p.m.)
> 
> 
> Review request for mesos.
> 
> 
> Description
> -------
> 
> From d01482457f02acc1e19195995db7a14dfc2a89b9 Mon Sep 17 00:00:00 2001
> From: Brenden Matthews <brenden.matthews@airbnb.com>
> Date: Mon, 6 May 2013 09:54:03 -0700
> Subject: [PATCH] Terminate correct tasks when a slave disconnects.
> 
> Previously, when a slave disconnected all tasks for that framework would
> be removed and it would result in a bad state for a given framework.  In
> the case of Hadoop, it would result in a bunch of zombie tasks running
> on the slaves which never terminate.
> 
> Added some `operator !=' type utilities.
> ---
>  src/common/type_utils.hpp |   66 +++++++++++++++++++++++++++++++++++++++++++++
>  src/master/master.cpp     |    8 ++++--
>  2 files changed, 72 insertions(+), 2 deletions(-)
> 
> 
> Below is a sample of what the Mesos master log looks like:
> 
> 
> I0506 03:01:21.188874  2639 master.cpp:445] Slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
disconnected
> I0506 03:01:21.189184  2639 master.cpp:464] Removing non-checkpointing framework 201305040040-4196536586-5050-1124-0000
from disconn
> ected slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
> I0506 03:01:21.190471  2639 master.hpp:295] Removing task Task_Tracker_46 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-4196536586-5050-1124-3
> I0506 03:01:21.190891  2632 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=763224) on slave 201305040040-4196536586-5050-1124-3 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.191614  2639 master.hpp:295] Removing task Task_Tracker_154 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-38
> I0506 03:01:21.192049  2634 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=761189) on slave 201305040040-3141079306-5050-1068-38 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.192828  2639 master.hpp:295] Removing task Task_Tracker_195 with resources
cpus=6.5; mem=13312; disk=53248; ports=[31999-31999, 31001-31001] on slave 201305040040-3141079306-5050-1068-85
> I0506 03:01:21.193270  2640 hierarchical_allocator_process.hpp:544] Recovered cpus=6.5;
mem=13312; disk=53248; ports=[31999-31999, 31001-31001] (total allocatable: cpus=10; mem=13408.8;
ports=[31001-31999]; disk=596893) on slave 201305040040-3141079306-5050-1068-85 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.194039  2639 master.hpp:295] Removing task Task_Tracker_182 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-45
> I0506 03:01:21.194425  2638 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=760196) on slave 201305040040-3141079306-5050-1068-45 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.195190  2639 master.hpp:295] Removing task Task_Tracker_58 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-76
> I0506 03:01:21.195636  2636 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=761175) on slave 201305040040-3141079306-5050-1068-76 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.196455  2639 master.hpp:295] Removing task Task_Tracker_160 with resources
cpus=20; mem=40960; disk=163840; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-85
> I0506 03:01:21.196883  2631 hierarchical_allocator_process.hpp:544] Recovered cpus=20;
mem=40960; disk=163840; ports=[31000-31000, 32000-32000] (total allocatable: cpus=30; mem=54368.8;
ports=[31000-32000]; disk=760733) on slave 201305040040-3141079306-5050-1068-85 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.197710  2639 master.hpp:295] Removing task Task_Tracker_96 with resources
cpus=3.5; mem=7168; disk=28672; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-80
> <...log continues...>
> 
> 
> Diffs
> -----
> 
>   src/common/type_utils.hpp 377b65f 
>   src/master/master.cpp 3207157 
> 
> Diff: https://reviews.apache.org/r/10951/diff/
> 
> 
> Testing
> -------
> 
> Used in production at airbnb.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message