mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brenden Matthews" <bren...@diddyinc.com>
Subject Re: Review Request: Terminate correct tasks when a slave disconnects.
Date Mon, 06 May 2013 18:04:56 GMT


> On May 6, 2013, 5:20 p.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 1784-1786
> > <https://reviews.apache.org/r/10951/diff/1/?file=288131#file288131line1784>
> >
> >     Wow. This is really a bug. Thanks for catching this!
> >     
> >     I think a better way to do this, is to change the foreach loop (#1776) to loop
through the slave's tasks instead of framework's tasks (which can be huge!). Inside the for
loop we can check if the task belongs to the removing framework or not. Makes sense?
> >     
> >     Also, we always use braces around if/for statements.
> >

The other ones are blockers too.  This one is actually less of a blocker than some of the
others, since the map reduce jobs will still finish.  The one where mesos kills task trackers
before jobs finish is a bigger problem (fixed with https://reviews.apache.org/r/10920/)


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10951/#review20194
-----------------------------------------------------------


On May 6, 2013, 6:04 p.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10951/
> -----------------------------------------------------------
> 
> (Updated May 6, 2013, 6:04 p.m.)
> 
> 
> Review request for mesos.
> 
> 
> Description
> -------
> 
> From d01482457f02acc1e19195995db7a14dfc2a89b9 Mon Sep 17 00:00:00 2001
> From: Brenden Matthews <brenden.matthews@airbnb.com>
> Date: Mon, 6 May 2013 09:54:03 -0700
> Subject: [PATCH] Terminate correct tasks when a slave disconnects.
> 
> Previously, when a slave disconnected all tasks for that framework would
> be removed and it would result in a bad state for a given framework.  In
> the case of Hadoop, it would result in a bunch of zombie tasks running
> on the slaves which never terminate.
> 
> Added some `operator !=' type utilities.
> ---
>  src/common/type_utils.hpp |   66 +++++++++++++++++++++++++++++++++++++++++++++
>  src/master/master.cpp     |    8 ++++--
>  2 files changed, 72 insertions(+), 2 deletions(-)
> 
> 
> Below is a sample of what the Mesos master log looks like:
> 
> 
> I0506 03:01:21.188874  2639 master.cpp:445] Slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
disconnected
> I0506 03:01:21.189184  2639 master.cpp:464] Removing non-checkpointing framework 201305040040-4196536586-5050-1124-0000
from disconn
> ected slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
> I0506 03:01:21.190471  2639 master.hpp:295] Removing task Task_Tracker_46 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-4196536586-5050-1124-3
> I0506 03:01:21.190891  2632 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=763224) on slave 201305040040-4196536586-5050-1124-3 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.191614  2639 master.hpp:295] Removing task Task_Tracker_154 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-38
> I0506 03:01:21.192049  2634 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=761189) on slave 201305040040-3141079306-5050-1068-38 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.192828  2639 master.hpp:295] Removing task Task_Tracker_195 with resources
cpus=6.5; mem=13312; disk=53248; ports=[31999-31999, 31001-31001] on slave 201305040040-3141079306-5050-1068-85
> I0506 03:01:21.193270  2640 hierarchical_allocator_process.hpp:544] Recovered cpus=6.5;
mem=13312; disk=53248; ports=[31999-31999, 31001-31001] (total allocatable: cpus=10; mem=13408.8;
ports=[31001-31999]; disk=596893) on slave 201305040040-3141079306-5050-1068-85 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.194039  2639 master.hpp:295] Removing task Task_Tracker_182 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-45
> I0506 03:01:21.194425  2638 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=760196) on slave 201305040040-3141079306-5050-1068-45 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.195190  2639 master.hpp:295] Removing task Task_Tracker_58 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-76
> I0506 03:01:21.195636  2636 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=761175) on slave 201305040040-3141079306-5050-1068-76 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.196455  2639 master.hpp:295] Removing task Task_Tracker_160 with resources
cpus=20; mem=40960; disk=163840; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-85
> I0506 03:01:21.196883  2631 hierarchical_allocator_process.hpp:544] Recovered cpus=20;
mem=40960; disk=163840; ports=[31000-31000, 32000-32000] (total allocatable: cpus=30; mem=54368.8;
ports=[31000-32000]; disk=760733) on slave 201305040040-3141079306-5050-1068-85 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.197710  2639 master.hpp:295] Removing task Task_Tracker_96 with resources
cpus=3.5; mem=7168; disk=28672; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-80
> <...log continues...>
> 
> 
> Diffs
> -----
> 
>   src/common/type_utils.hpp 377b65f 
>   src/master/master.cpp 3207157 
> 
> Diff: https://reviews.apache.org/r/10951/diff/
> 
> 
> Testing
> -------
> 
> Used in production at airbnb.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message