Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C73FD11394 for ; Tue, 23 Sep 2014 15:37:41 +0000 (UTC) Received: (qmail 48234 invoked by uid 500); 23 Sep 2014 15:22:15 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 48168 invoked by uid 500); 23 Sep 2014 15:22:15 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 48125 invoked by uid 99); 23 Sep 2014 15:22:14 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Sep 2014 15:22:14 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 6DBEC1DD8F1; Tue, 23 Sep 2014 15:22:11 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============4523066091682756192==" MIME-Version: 1.0 Subject: Re: Review Request 25911: Changed master to free up resources for completed tasks when framework is disconnected. From: "Niklas Nielsen" To: "Ben Mahler" Cc: "Niklas Nielsen" , "Timothy Chen" , "mesos" Date: Tue, 23 Sep 2014 15:22:11 -0000 Message-ID: <20140923152211.14999.36949@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Niklas Nielsen" X-ReviewGroup: mesos X-ReviewRequest-URL: https://reviews.apache.org/r/25911/ X-Sender: "Niklas Nielsen" References: <20140923004247.15006.48788@reviews.apache.org> In-Reply-To: <20140923004247.15006.48788@reviews.apache.org> Reply-To: "Niklas Nielsen" X-ReviewRequest-Repository: mesos-git --===============4523066091682756192== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Sept. 22, 2014, 5:42 p.m., Timothy Chen wrote: > > src/master/master.cpp, line 4401 > > > > > > Seems like resources recovered is only used internally for the master, any reason why introducing a new protobuf field instead of just storing it locally? Thanks for bringing this up - Ben M reached out on IRC and I am working on a refactor which introduce a task struct where we can hang this off. - Niklas ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25911/#review54221 ----------------------------------------------------------- On Sept. 22, 2014, 3:30 p.m., Niklas Nielsen wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/25911/ > ----------------------------------------------------------- > > (Updated Sept. 22, 2014, 3:30 p.m.) > > > Review request for mesos and Ben Mahler. > > > Bugs: MESOS-1817 > https://issues.apache.org/jira/browse/MESOS-1817 > > > Repository: mesos-git > > > Description > ------- > > We have run into a problem that cause tasks which completes, when a > framework is disconnected and has a fail-over time, to remain in a > running state even though the tasks actually finishes. This hogs the > cluster and gives users a inconsistent view of the cluster state. > > The problem turn out to be an issue with the ack-cycle of status > updates: If the framework disconnects (with a failover timeout set), the > status update manage on the slaves will keep trying to send the front of > status update stream to the master (which in turn forwards it to the > framework). If the first status update after the disconnect is terminal, > things work out fine; the master picks the terminal state up, removes > the task and release the resources. If, on the other hand, one > non-terminal status is in the stream. The master will never know that > the task finished (or failed) before the framework reconnects. > > As a first pass, this patch makes the status update manager inform the > master if a terminal state was found in the pending stream of a task. > If so, the master will recover the resources but will still wait the > updates to arrive before updating the task state and statuses. > > > Diffs > ----- > > src/master/master.hpp f5d74ae > src/master/master.cpp e5d30e9 > src/messages/messages.proto 7cb3ce6 > src/slave/status_update_manager.hpp 24e3882 > src/slave/status_update_manager.cpp 5d5cf23 > src/tests/fault_tolerance_tests.cpp 1543860 > > Diff: https://reviews.apache.org/r/25911/diff/ > > > Testing > ------- > > Added a new test: FaultToleranceTest.RecoverResourcesDuringSchedulerDisconnect which exercise the new code path. > > make check > > > Thanks, > > Niklas Nielsen > > --===============4523066091682756192==--