Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CBD3D96DB for ; Tue, 3 Apr 2012 20:58:18 +0000 (UTC) Received: (qmail 9990 invoked by uid 500); 3 Apr 2012 20:58:18 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 9956 invoked by uid 500); 3 Apr 2012 20:58:18 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 9944 invoked by uid 99); 3 Apr 2012 20:58:18 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Apr 2012 20:58:18 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 120EE1C3ADA; Tue, 3 Apr 2012 20:58:18 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============2180678955089637090==" MIME-Version: 1.0 Subject: Re: Review Request: Non-disruptive Slave Restart with Recovery! From: "Vinod Kone" To: "Benjamin Hindman" , "John Sirois" Date: Tue, 03 Apr 2012 20:58:18 -0000 Message-ID: <20120403205818.4079.20135@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org X-ReviewRequest-URL: https://reviews.apache.org/r/4462/ Cc: "Charles Reiss" , "mesos" , "Vinod Kone" In-Reply-To: <20120323011812.382.77230@reviews.apache.org> References: <20120323011812.382.77230@reviews.apache.org> --===============2180678955089637090== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4462/ ----------------------------------------------------------- (Updated 2012-04-03 20:58:17.982746) Review request for mesos, Benjamin Hindman and John Sirois. Changes ------- merged with trunk Summary ------- Sorry for the huge CL! Slave restarts now supports recovery! --> Non-disruptive restart means running tasks are not lost --> Re-connects with live executors --> Checkpoints and reliably sends status updates --> Ability to kill executors if the slave upgrade is incompatible with run= ning executors This addresses bug mesos-110. https://issues.apache.org/jira/browse/mesos-110 Diffs (updated) ----- src/Makefile.am d5edaa2 = src/common/hashset.hpp 1feb610 = src/common/utils.hpp 1d81e21 = src/exec/exec.cpp e8db407 = src/launcher/launcher.cpp a141b9a = src/local/local.hpp 55f9eaf = src/local/local.cpp affe432 = src/master/master.cpp 4dc9ee0 = src/messages/messages.proto 87e1548 = src/sched/sched.cpp dcadb10 = src/slave/constants.hpp f0c8679 = src/slave/isolation_module.hpp c896908 = src/slave/lxc_isolation_module.hpp b7beefe = src/slave/lxc_isolation_module.cpp 66a2a89 = src/slave/main.cpp 85cba25 = src/slave/process_based_isolation_module.hpp f6f9554 = src/slave/process_based_isolation_module.cpp 2b37d42 = src/slave/slave.hpp 279bc7b = src/slave/slave.cpp 3358ec4 = src/tests/fault_tolerance_tests.cpp 6772daf = src/tests/slave_restart_tests.cpp PRE-CREATION = src/tests/utils.hpp e81ec82 = Diff: https://reviews.apache.org/r/4462/diff Testing ------- make check. Note that only the new test in tests/slave_restart_tests.cpp engages in re= covery! Recovery is disabled for old tests (though they still checkpoint relevant i= nfo!) Thanks, Vinod --===============2180678955089637090==--