mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone" <vinodk...@gmail.com>
Subject Re: Review Request: Terminate correct tasks when a slave disconnects.
Date Mon, 06 May 2013 17:20:25 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10951/#review20194
-----------------------------------------------------------



src/master/master.cpp
<https://reviews.apache.org/r/10951/#comment41437>

    thank you.



src/master/master.cpp
<https://reviews.apache.org/r/10951/#comment41442>

    Wow. This is really a bug. Thanks for catching this!
    
    I think a better way to do this, is to change the foreach loop (#1776) to loop through
the slave's tasks instead of framework's tasks (which can be huge!). Inside the for loop we
can check if the task belongs to the removing framework or not. Makes sense?
    
    Also, we always use braces around if/for statements.
    


- Vinod Kone


On May 6, 2013, 4:59 p.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10951/
> -----------------------------------------------------------
> 
> (Updated May 6, 2013, 4:59 p.m.)
> 
> 
> Review request for mesos.
> 
> 
> Description
> -------
> 
> From d5576303ecaaf3c02eba082c8d5b6cf483e36dae Mon Sep 17 00:00:00 2001
> From: Brenden Matthews <brenden.matthews@airbnb.com>
> Date: Mon, 6 May 2013 09:54:03 -0700
> Subject: [PATCH] Terminate correct tasks when a slave disconnects.
> 
> Previously, when a slave disconnected all tasks for that framework would
> be removed and it would result in a bad state for a given framework.  In
> the case of Hadoop, it would result in a bunch of zombie tasks running
> on the slaves which never terminate.
> ---
>  src/master/master.cpp |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> 
> Below is a sample of what the Mesos master log looks like:
> 
> 
> I0506 03:01:21.188874  2639 master.cpp:445] Slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
disconnected
> I0506 03:01:21.189184  2639 master.cpp:464] Removing non-checkpointing framework 201305040040-4196536586-5050-1124-0000
from disconn
> ected slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
> I0506 03:01:21.190471  2639 master.hpp:295] Removing task Task_Tracker_46 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-4196536586-5050-1124-3
> I0506 03:01:21.190891  2632 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=763224) on slave 201305040040-4196536586-5050-1124-3 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.191614  2639 master.hpp:295] Removing task Task_Tracker_154 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-38
> I0506 03:01:21.192049  2634 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=761189) on slave 201305040040-3141079306-5050-1068-38 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.192828  2639 master.hpp:295] Removing task Task_Tracker_195 with resources
cpus=6.5; mem=13312; disk=53248; ports=[31999-31999, 31001-31001] on slave 201305040040-3141079306-5050-1068-85
> I0506 03:01:21.193270  2640 hierarchical_allocator_process.hpp:544] Recovered cpus=6.5;
mem=13312; disk=53248; ports=[31999-31999, 31001-31001] (total allocatable: cpus=10; mem=13408.8;
ports=[31001-31999]; disk=596893) on slave 201305040040-3141079306-5050-1068-85 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.194039  2639 master.hpp:295] Removing task Task_Tracker_182 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-45
> I0506 03:01:21.194425  2638 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=760196) on slave 201305040040-3141079306-5050-1068-45 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.195190  2639 master.hpp:295] Removing task Task_Tracker_58 with resources
cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-76
> I0506 03:01:21.195636  2636 hierarchical_allocator_process.hpp:544] Recovered cpus=9;
mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total allocatable: cpus=15; mem=19180.2;
ports=[31000-32000]; disk=761175) on slave 201305040040-3141079306-5050-1068-76 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.196455  2639 master.hpp:295] Removing task Task_Tracker_160 with resources
cpus=20; mem=40960; disk=163840; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-85
> I0506 03:01:21.196883  2631 hierarchical_allocator_process.hpp:544] Recovered cpus=20;
mem=40960; disk=163840; ports=[31000-31000, 32000-32000] (total allocatable: cpus=30; mem=54368.8;
ports=[31000-32000]; disk=760733) on slave 201305040040-3141079306-5050-1068-85 from framework
201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.197710  2639 master.hpp:295] Removing task Task_Tracker_96 with resources
cpus=3.5; mem=7168; disk=28672; ports=[31000-31000, 32000-32000] on slave 201305040040-3141079306-5050-1068-80
> <...log continues...>
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 3207157 
> 
> Diff: https://reviews.apache.org/r/10951/diff/
> 
> 
> Testing
> -------
> 
> Used in production at airbnb.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message