mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brenden Matthews" <bren...@diddyinc.com>
Subject Re: Review Request: Terminate correct tasks when a slave disconnects.
Date Mon, 06 May 2013 17:29:41 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10951/
-----------------------------------------------------------

(Updated May 6, 2013, 5:29 p.m.)


Review request for mesos.


Description
-------

>From d5576303ecaaf3c02eba082c8d5b6cf483e36dae Mon Sep 17 00:00:00 2001
From: Brenden Matthews <brenden.matthews@airbnb.com>
Date: Mon, 6 May 2013 09:54:03 -0700
Subject: [PATCH] Terminate correct tasks when a slave disconnects.

Previously, when a slave disconnected all tasks for that framework would
be removed and it would result in a bad state for a given framework.  In
the case of Hadoop, it would result in a bunch of zombie tasks running
on the slaves which never terminate.
---
 src/master/master.cpp |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)


Below is a sample of what the Mesos master log looks like:


I0506 03:01:21.188874  2639 master.cpp:445] Slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
disconnected
I0506 03:01:21.189184  2639 master.cpp:464] Removing non-checkpointing framework 201305040040-4196536586-5050-1124-0000
from disconn
ected slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
I0506 03:01:21.190471  2639 master.hpp:295] Removing task Task_Tracker_46 with resources cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-4196536586-5050-1124-3
I0506 03:01:21.190891  2632 hierarchical_allocator_process.hpp:544] Recovered cpus=9; mem=18432;
disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2; ports=[31000-32000];
disk=763224) on slave 201305040040-4196536586-5050-1124-3 from framework 201305040040-4196536586-5050-1124-0000
I0506 03:01:21.191614  2639 master.hpp:295] Removing task Task_Tracker_154 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-38
I0506 03:01:21.192049  2634 hierarchical_allocator_process.hpp:544] Recovered cpus=9; mem=18432;
disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2; ports=[31000-32000];
disk=761189) on slave 201305040040-3141079306-5050-1068-38 from framework 201305040040-4196536586-5050-1124-0000
I0506 03:01:21.192828  2639 master.hpp:295] Removing task Task_Tracker_195 with resources
cpus=6.5; mem=13312; disk=53248; ports=[31999-31999, 31001-31001] on slave 201305040040-3141079306-5050-1068-85
I0506 03:01:21.193270  2640 hierarchical_allocator_process.hpp:544] Recovered cpus=6.5; mem=13312;
disk=53248; ports=[31999-31999, 31001-31001] (total allocatable: cpus=10; mem=13408.8; ports=[31001-31999];
disk=596893) on slave 201305040040-3141079306-5050-1068-85 from framework 201305040040-4196536586-5050-1124-0000
I0506 03:01:21.194039  2639 master.hpp:295] Removing task Task_Tracker_182 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-45
I0506 03:01:21.194425  2638 hierarchical_allocator_process.hpp:544] Recovered cpus=9; mem=18432;
disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2; ports=[31000-32000];
disk=760196) on slave 201305040040-3141079306-5050-1068-45 from framework 201305040040-4196536586-5050-1124-0000
I0506 03:01:21.195190  2639 master.hpp:295] Removing task Task_Tracker_58 with resources cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-76
I0506 03:01:21.195636  2636 hierarchical_allocator_process.hpp:544] Recovered cpus=9; mem=18432;
disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2; ports=[31000-32000];
disk=761175) on slave 201305040040-3141079306-5050-1068-76 from framework 201305040040-4196536586-5050-1124-0000
I0506 03:01:21.196455  2639 master.hpp:295] Removing task Task_Tracker_160 with resources
cpus=20; mem=40960; disk=163840; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-85
I0506 03:01:21.196883  2631 hierarchical_allocator_process.hpp:544] Recovered cpus=20; mem=40960;
disk=163840; ports=[31000-31000, 32000-32000] (total allocatable: cpus=30; mem=54368.8; ports=[31000-32000];
disk=760733) on slave 201305040040-3141079306-5050-1068-85 from framework 201305040040-4196536586-5050-1124-0000
I0506 03:01:21.197710  2639 master.hpp:295] Removing task Task_Tracker_96 with resources cpus=3.5;
mem=7168; disk=28672; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-80
<...log continues...>


Diffs (updated)
-----

  src/master/master.hpp d3790dc 
  src/master/master.cpp 3207157 

Diff: https://reviews.apache.org/r/10951/diff/


Testing
-------

Used in production at airbnb.


Thanks,

Brenden Matthews


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message