mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Meng Zhu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-8850) Race between master and allocator when destroying shared volume could lead to sorter check failure.
Date Thu, 26 Apr 2018 19:11:01 GMT
Meng Zhu created MESOS-8850:
-------------------------------

             Summary: Race between master and allocator when destroying shared volume could
lead to sorter check failure.
                 Key: MESOS-8850
                 URL: https://issues.apache.org/jira/browse/MESOS-8850
             Project: Mesos
          Issue Type: Bug
          Components: allocation, master
            Reporter: Meng Zhu


When destroying shared volume, master first rescinds offers that contain the shared volume
and then apply the destroy operation. This process involves interaction between the master
and allocator actor. The following race could arise:

1. Framework1 and framework2 are each offered a shared disk;
2. Framework2 asks the master to destroy the shared disk;
3. Master rescinds framework1's offer that contains the shared disk;
4. `allocator->recoverResources` is called to recover framework1’s offered resources
in the allocator;
5. [Race] Allocator shortly allocates resources to framework1. The allocation contains the
shared disk that just got recovered which has not been destroyed at the moment. Allocator
invokes `offerCallback` which dispatches to the master;
6. Master continues the destroy operation and calls `allocator->updateAllocation` to notify
the allocator to transform the shared disk to regular reserved disk;
7. Master processes the `offerCallback` dispatched in step5 and offered the shared disk to
framework1.

At this point, the same disk resource appears in two different places: one shared offered
to framework1, one not shared currently hold by framework2 (soon to be recovered).

One aftermath is that:
Framework2’s resources get recovered which includes the (now regular reserved) disk resource.
Later, when recovering framework1’s resources which contains the shared disk, the sorter
finds that allocated resources on the agent do not contain that shared disk (because in step
5 when offering the shared disk, the allocator did not increase the total allocated resources
as framework2 was also holding the shared disk. We only add shared resource to allocated only
when it is allocated the first time).

This will lead to check failure in sorter:
https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L480

Moving offer management to the allocator could definitely eliminate this race. Without that,
we will need to add extra synchronizations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message