Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 46AB2108F7 for ; Mon, 6 May 2013 18:35:06 +0000 (UTC) Received: (qmail 50560 invoked by uid 500); 6 May 2013 18:35:06 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 50532 invoked by uid 500); 6 May 2013 18:35:06 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 50487 invoked by uid 99); 6 May 2013 18:35:05 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 May 2013 18:35:05 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id BDE751C95DF; Mon, 6 May 2013 18:35:01 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============2155736012935784034==" MIME-Version: 1.0 Subject: Re: Review Request: Terminate correct tasks when a slave disconnects. From: "Vinod Kone" To: "Vinod Kone" , "mesos" , "Brenden Matthews" Date: Mon, 06 May 2013 18:35:01 -0000 Message-ID: <20130506183501.5514.73456@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Vinod Kone" X-ReviewGroup: mesos X-ReviewRequest-URL: https://reviews.apache.org/r/10951/ X-Sender: "Vinod Kone" References: <20130506180439.5514.7624@reviews.apache.org> In-Reply-To: <20130506180439.5514.7624@reviews.apache.org> Reply-To: "Vinod Kone" --===============2155736012935784034== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10951/#review20210 ----------------------------------------------------------- src/common/type_utils.hpp we typically don't overload "!=3D" operators for protobufs, but rather = use "!(protobuf1 =3D=3D protobuf2)". = i know thats annoying, but we would like to keep type_utils as short as= possible. = also, we only specifically overload "=3D=3D" operator for a protobuf, w= hen the default is not good enough. src/master/master.cpp you could do. = if (!(task->framework_id =3D=3D framework->id)) { .. } - Vinod Kone On May 6, 2013, 6:04 p.m., Brenden Matthews wrote: > = > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/10951/ > ----------------------------------------------------------- > = > (Updated May 6, 2013, 6:04 p.m.) > = > = > Review request for mesos. > = > = > Description > ------- > = > From d01482457f02acc1e19195995db7a14dfc2a89b9 Mon Sep 17 00:00:00 2001 > From: Brenden Matthews > Date: Mon, 6 May 2013 09:54:03 -0700 > Subject: [PATCH] Terminate correct tasks when a slave disconnects. > = > Previously, when a slave disconnected all tasks for that framework would > be removed and it would result in a bad state for a given framework. In > the case of Hadoop, it would result in a bunch of zombie tasks running > on the slaves which never terminate. > = > Added some `operator !=3D' type utilities. > --- > src/common/type_utils.hpp | 66 +++++++++++++++++++++++++++++++++++++++= ++++++ > src/master/master.cpp | 8 ++++-- > 2 files changed, 72 insertions(+), 2 deletions(-) > = > = > Below is a sample of what the Mesos master log looks like: > = > = > I0506 03:01:21.188874 2639 master.cpp:445] Slave 201305040040-3141079306= -5050-1068-21(i-ced4aba2) disconnected > I0506 03:01:21.189184 2639 master.cpp:464] Removing non-checkpointing fr= amework 201305040040-4196536586-5050-1124-0000 from disconn > ected slave 201305040040-3141079306-5050-1068-21(i-ced4aba2) > I0506 03:01:21.190471 2639 master.hpp:295] Removing task Task_Tracker_46= with resources cpus=3D9; mem=3D18432; disk=3D73728; ports=3D[31000-31000, = 32000-32000] on slave 201305040040-4196536586-5050-1124-3 > I0506 03:01:21.190891 2632 hierarchical_allocator_process.hpp:544] Recov= ered cpus=3D9; mem=3D18432; disk=3D73728; ports=3D[31000-31000, 32000-32000= ] (total allocatable: cpus=3D15; mem=3D19180.2; ports=3D[31000-32000]; disk= =3D763224) on slave 201305040040-4196536586-5050-1124-3 from framework 2013= 05040040-4196536586-5050-1124-0000 > I0506 03:01:21.191614 2639 master.hpp:295] Removing task Task_Tracker_15= 4 with resources cpus=3D9; mem=3D18432; disk=3D73728; ports=3D[31000-31000,= 32000-32000] on slave 201305040040-3141079306-5050-1068-38 > I0506 03:01:21.192049 2634 hierarchical_allocator_process.hpp:544] Recov= ered cpus=3D9; mem=3D18432; disk=3D73728; ports=3D[31000-31000, 32000-32000= ] (total allocatable: cpus=3D15; mem=3D19180.2; ports=3D[31000-32000]; disk= =3D761189) on slave 201305040040-3141079306-5050-1068-38 from framework 201= 305040040-4196536586-5050-1124-0000 > I0506 03:01:21.192828 2639 master.hpp:295] Removing task Task_Tracker_19= 5 with resources cpus=3D6.5; mem=3D13312; disk=3D53248; ports=3D[31999-3199= 9, 31001-31001] on slave 201305040040-3141079306-5050-1068-85 > I0506 03:01:21.193270 2640 hierarchical_allocator_process.hpp:544] Recov= ered cpus=3D6.5; mem=3D13312; disk=3D53248; ports=3D[31999-31999, 31001-310= 01] (total allocatable: cpus=3D10; mem=3D13408.8; ports=3D[31001-31999]; di= sk=3D596893) on slave 201305040040-3141079306-5050-1068-85 from framework 2= 01305040040-4196536586-5050-1124-0000 > I0506 03:01:21.194039 2639 master.hpp:295] Removing task Task_Tracker_18= 2 with resources cpus=3D9; mem=3D18432; disk=3D73728; ports=3D[31000-31000,= 32000-32000] on slave 201305040040-3141079306-5050-1068-45 > I0506 03:01:21.194425 2638 hierarchical_allocator_process.hpp:544] Recov= ered cpus=3D9; mem=3D18432; disk=3D73728; ports=3D[31000-31000, 32000-32000= ] (total allocatable: cpus=3D15; mem=3D19180.2; ports=3D[31000-32000]; disk= =3D760196) on slave 201305040040-3141079306-5050-1068-45 from framework 201= 305040040-4196536586-5050-1124-0000 > I0506 03:01:21.195190 2639 master.hpp:295] Removing task Task_Tracker_58= with resources cpus=3D9; mem=3D18432; disk=3D73728; ports=3D[31000-31000, = 32000-32000] on slave 201305040040-3141079306-5050-1068-76 > I0506 03:01:21.195636 2636 hierarchical_allocator_process.hpp:544] Recov= ered cpus=3D9; mem=3D18432; disk=3D73728; ports=3D[31000-31000, 32000-32000= ] (total allocatable: cpus=3D15; mem=3D19180.2; ports=3D[31000-32000]; disk= =3D761175) on slave 201305040040-3141079306-5050-1068-76 from framework 201= 305040040-4196536586-5050-1124-0000 > I0506 03:01:21.196455 2639 master.hpp:295] Removing task Task_Tracker_16= 0 with resources cpus=3D20; mem=3D40960; disk=3D163840; ports=3D[31000-3100= 0, 32000-32000] on slave 201305040040-3141079306-5050-1068-85 > I0506 03:01:21.196883 2631 hierarchical_allocator_process.hpp:544] Recov= ered cpus=3D20; mem=3D40960; disk=3D163840; ports=3D[31000-31000, 32000-320= 00] (total allocatable: cpus=3D30; mem=3D54368.8; ports=3D[31000-32000]; di= sk=3D760733) on slave 201305040040-3141079306-5050-1068-85 from framework 2= 01305040040-4196536586-5050-1124-0000 > I0506 03:01:21.197710 2639 master.hpp:295] Removing task Task_Tracker_96= with resources cpus=3D3.5; mem=3D7168; disk=3D28672; ports=3D[31000-31000,= 32000-32000] on slave 201305040040-3141079306-5050-1068-80 > <...log continues...> > = > = > Diffs > ----- > = > src/common/type_utils.hpp 377b65f = > src/master/master.cpp 3207157 = > = > Diff: https://reviews.apache.org/r/10951/diff/ > = > = > Testing > ------- > = > Used in production at airbnb. > = > = > Thanks, > = > Brenden Matthews > = > --===============2155736012935784034==--