Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mesos.apache.org
Content-Type: multipart/alternative;
 boundary="===============4523066091682756192=="
MIME-Version: 1.0
Subject: Re: Review Request 25911: Changed master to free up resources for
 completed tasks when framework is disconnected.
From: "Niklas Nielsen" <nik@qni.dk>
To: "Ben Mahler" <benjamin.mahler@gmail.com>
Cc: "Niklas Nielsen" <nik@qni.dk>, "Timothy Chen" <tnachen@apache.org>,
 "mesos" <dev@mesos.apache.org>
Date: Tue, 23 Sep 2014 15:22:11 -0000
Message-ID: <20140923152211.14999.36949@reviews.apache.org>
Auto-Submitted: auto-generated
Sender: "Niklas Nielsen" <noreply@reviews.apache.org>
References: <20140923004247.15006.48788@reviews.apache.org>
In-Reply-To: <20140923004247.15006.48788@reviews.apache.org>
Reply-To: "Niklas Nielsen" <nik@qni.dk>

--===============4523066091682756192==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit


> On Sept. 22, 2014, 5:42 p.m., Timothy Chen wrote:
> > src/master/master.cpp, line 4401
> > <https://reviews.apache.org/r/25911/diff/1/?file=700520#file700520line4401>
> >
> >     Seems like resources recovered is only used internally for the master, any reason why introducing a new protobuf field instead of just storing it locally?

Thanks for bringing this up - Ben M reached out on IRC and I am working on a refactor which introduce a task struct where we can hang this off.


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25911/#review54221
-----------------------------------------------------------


On Sept. 22, 2014, 3:30 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25911/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2014, 3:30 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> We have run into a problem that cause tasks which completes, when a
> framework is disconnected and has a fail-over time, to remain in a
> running state even though the tasks actually finishes. This hogs the
> cluster and gives users a inconsistent view of the cluster state.
> 
> The problem turn out to be an issue with the ack-cycle of status
> updates: If the framework disconnects (with a failover timeout set), the
> status update manage on the slaves will keep trying to send the front of
> status update stream to the master (which in turn forwards it to the
> framework). If the first status update after the disconnect is terminal,
> things work out fine; the master picks the terminal state up, removes
> the task and release the resources. If, on the other hand, one
> non-terminal status is in the stream. The master will never know that
> the task finished (or failed) before the framework reconnects.
> 
> As a first pass, this patch makes the status update manager inform the
> master if a terminal state was found in the pending stream of a task.
> If so, the master will recover the resources but will still wait the
> updates to arrive before updating the task state and statuses.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp f5d74ae 
>   src/master/master.cpp e5d30e9 
>   src/messages/messages.proto 7cb3ce6 
>   src/slave/status_update_manager.hpp 24e3882 
>   src/slave/status_update_manager.cpp 5d5cf23 
>   src/tests/fault_tolerance_tests.cpp 1543860 
> 
> Diff: https://reviews.apache.org/r/25911/diff/
> 
> 
> Testing
> -------
> 
> Added a new test: FaultToleranceTest.RecoverResourcesDuringSchedulerDisconnect which exercise the new code path.
> 
> make check
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


--===============4523066091682756192==--