Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 293E117B58 for ; Mon, 13 Oct 2014 10:14:18 +0000 (UTC) Received: (qmail 26798 invoked by uid 500); 13 Oct 2014 10:14:18 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 26733 invoked by uid 500); 13 Oct 2014 10:14:18 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 26719 invoked by uid 99); 13 Oct 2014 10:14:17 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Oct 2014 10:14:17 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 76B901D352C; Mon, 13 Oct 2014 10:14:13 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============4671630756290759253==" MIME-Version: 1.0 Subject: Re: Review Request 23912: Fix MESOS-947: Slave should properly handle a killTask() that arrives between runTask() and _runTask() From: "Bernd Mathiske" To: "Vinod Kone" , "Adam B" , "Bernd Mathiske" , "mesos" Date: Mon, 13 Oct 2014 10:14:13 -0000 Message-ID: <20141013101413.24413.48008@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Bernd Mathiske" X-ReviewGroup: mesos X-ReviewRequest-URL: https://reviews.apache.org/r/23912/ X-Sender: "Bernd Mathiske" References: <20141010182454.24817.5839@reviews.apache.org> In-Reply-To: <20141010182454.24817.5839@reviews.apache.org> Reply-To: "Bernd Mathiske" X-ReviewRequest-Repository: mesos-git --===============4671630756290759253== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote: > > src/tests/slave_tests.cpp, line 1027 > > > > > > why this change? More readable. But I'll revert it. > On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote: > > src/tests/slave_tests.cpp, line 1087 > > > > > > How can removeFramework() be called twice!!!???? Wouldn't that throw a CHECK failure? I thought that shutting down would call it again, but no. The debugger says gmock reacts to the Master::removeFramework call, not Slave::removeFramework, here. Will investigate. > On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote: > > src/slave/slave.cpp, line 1413 > > > > > > It is weird to me that you remove the task here but (potentially) remove the executor up in _runTask(). It's not obvious to me why you made that choice. If there is a good reason, please add a comment here. The task is removed in killTask(), because later on the information that it should be removed is no longer easily at hand. I looked at lines 1194, 1195 in _runTask(): Framework* framework = getFramework(frameworkId); CHECK_NOTNULL(framework); If I removed the framework earlier than this, this check would fire. If I want to avoid that, I need to write extra code (e.g. prevent _runTask() from happening). Assembling all framework removals in the second half of _runTask() looks like good defensive practice to me. There is no harm in removing the framework in _runTask() vs. killTask() since _runTask() must eventually happen. But by waiting with removing the framework until _runTask(), we do not need to think about what implications concurrent starting and killing of multiple tasks with the same executor might have. > On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote: > > src/slave/slave.cpp, lines 1212-1214 > > > > > > This seems out of place to me. See my comments below. See below. - Bernd ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23912/#review56016 ----------------------------------------------------------- On Oct. 9, 2014, 7:10 a.m., Bernd Mathiske wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/23912/ > ----------------------------------------------------------- > > (Updated Oct. 9, 2014, 7:10 a.m.) > > > Review request for mesos. > > > Bugs: MESOS-947 > https://issues.apache.org/jira/browse/MESOS-947 > > > Repository: mesos-git > > > Description > ------- > > Fixes MESOS-947 "Slave should properly handle a killTask() that arrives between runTask() and _runTask()". > > Slave::killTask() did not check for task in question combination to be "pending" (i.e. Slave::runTask had happened, but Slave::_runTask had not yet) and then erroneously assumed that Slave::runTask() had not been executed. The task was then marked "LOST" instead of "KILLED". But Slave::runTask had already scheduled Slave::_runTask to follow. Now the entry for being "pending" is removed, and the task is marked "KILLED", and _runTask gets informed about this. It checks whether the task in question is currently "pending" and if it is not, then it infers that the task has been killed and does not erroneously try to complete launching it. > > > Diffs > ----- > > src/slave/slave.hpp 76d505c698774204b2536b66ea8a83a9a2a5e2c1 > src/slave/slave.cpp cb3759993f863590cb1545c73072feb0331aa6c9 > src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 > src/tests/mesos.cpp 3dcb2acd5ad4ab5e3a7b4fe524ee077558112773 > src/tests/slave_tests.cpp 69be28f6e82b99e23424bd2be8294f715d8040d4 > > Diff: https://reviews.apache.org/r/23912/diff/ > > > Testing > ------- > > Wrote a unit test that reliably created the situation described in the ticket. Observed that TASK_LOST and the listed log output occurred. This pointed directly to the lines in killTask() where the problem is rooted. Ran the test after fixing, it succeeded. Checked the log. It looks like a "clean kill" now :-) > > > Thanks, > > Bernd Mathiske > > --===============4671630756290759253==--