mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhitao Li (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-7566) Master crash due to failed check in DRFSorter::remove
Date Thu, 25 May 2017 21:38:04 GMT
Zhitao Li created MESOS-7566:
--------------------------------

             Summary: Master crash due to failed check in DRFSorter::remove
                 Key: MESOS-7566
                 URL: https://issues.apache.org/jira/browse/MESOS-7566
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 1.1.1, 1.1.2
            Reporter: Zhitao Li
            Priority: Critical


A check in [https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355
DRFSorter] is triggered occasionally in our cluster and crashes the master leader.

I manually modified that check to print out the related variables, and the following is a
master log.

https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt

>From the log, it seems like the check was using an stale value of {{cpus(*){REV}:26}}
while the new value was updated to {{cpus(*){REV}:25}}, thus it crashed.

So far two verified occurrence of this bug are both observed near an {{UNRESERVE}} operation
(see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message