Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mesos.apache.org
Date: Wed, 2 Oct 2013 00:41:24 +0000 (UTC)
From: "Benjamin Mahler (JIRA)" <jira@apache.org>
To: dev@mesos.apache.org
Message-ID: <JIRA.12671460.1380582290526.14937.1380674484194@arcas>
In-Reply-To: <JIRA.12671460.1380582290526@arcas>
References: <JIRA.12671460.1380582290526@arcas>
Subject: [jira] [Commented] (MESOS-711) Master::reconcile incorrectly
 recovers resources from reconciled tasks.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MESOS-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13783531#comment-13783531 ] 

Benjamin Mahler commented on MESOS-711:
---------------------------------------

https://reviews.apache.org/r/14435/
https://reviews.apache.org/r/14436/

> Master::reconcile incorrectly recovers resources from reconciled tasks.
> -----------------------------------------------------------------------
>
>                 Key: MESOS-711
>                 URL: https://issues.apache.org/jira/browse/MESOS-711
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.14.0
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>            Priority: Critical
>             Fix For: 0.15.0, 0.14.1
>
>
> The following sequence of events will over-subscribe a slave in the allocator:
> --> Slave re-registers with the same master due to a slave restart. Tasks were running on the slave, but are lost in the process of the slave restarting.
> --> As a result, the slave includes no task / executor information in it's re-registration message.
> --> The slave is added back to the allocator with it's full resources, in Master::reregisterSlave():
>       // If this is a disconnected slave, add it back to the allocator.
>       if (slave->disconnected) {
>         slave->disconnected = false; // Reset the flag.
>         hashmap<FrameworkID, Resources> resources;
>         foreach (const ExecutorInfo& executorInfo, executorInfos) {
>           resources[executorInfo.framework_id()] += executorInfo.resources();
>         }
>         foreach (const Task& task, tasks) {
>           // Ignore tasks that have reached terminal state.
>           if (!protobuf::isTerminalState(task.state())) {
>             resources[task.framework_id()] += task.resources();
>           }
>         }
>         allocator->slaveAdded(slaveId, slaveInfo, resources);
>       }
> --> Now reconciliation occurs, and the master sends TASK_LOST messages for each slave through Master::statusUpdate, which results in a call to Allocator::resourcesRecovered!
> --> Reconciliation also calls Allocator::resourcesRecovered for the unknown executors.
> --> These two bugs result in the allocator offering more resources than the slave contains.
> We can either change the re-registration code, or change the reconciliation code. The easiest fix here is to add the slave back taking into account the used resources from the slave *and the master's* information.


--
This message was sent by Atlassian JIRA
(v6.1#6144)