Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 53502200CAD for ; Tue, 13 Jun 2017 20:35:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 51CAE160BC5; Tue, 13 Jun 2017 18:35:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9FDAB160BDC for ; Tue, 13 Jun 2017 20:35:05 +0200 (CEST) Received: (qmail 68927 invoked by uid 500); 13 Jun 2017 18:35:04 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 68891 invoked by uid 99); 13 Jun 2017 18:35:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jun 2017 18:35:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 40D41C00B6 for ; Tue, 13 Jun 2017 18:35:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id luRgjnGkFved for ; Tue, 13 Jun 2017 18:35:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 41CE15F5A2 for ; Tue, 13 Jun 2017 18:35:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 8D42BE0641 for ; Tue, 13 Jun 2017 18:35:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 28E7321E0C for ; Tue, 13 Jun 2017 18:35:00 +0000 (UTC) Date: Tue, 13 Jun 2017 18:35:00 +0000 (UTC) From: "Zhitao Li (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-7639) Oversubscription could crash the master due to CHECK failure in the allocator MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 13 Jun 2017 18:35:06 -0000 [ https://issues.apache.org/jira/browse/MESOS-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048218#comment-16048218 ] Zhitao Li commented on MESOS-7639: ---------------------------------- So, I've created a test in https://github.com/zhitaoli/mesos/tree/zhitao/1.1.2/drf_sorter_crash_test which reliably crashes with the matching condition. However, when applying the same test to current master (with minimal modification), this does not crash the master anymore (the branch is at https://github.com/zhitaoli/mesos/tree/zhitao/public/revocable_drf_crash_test) With a bit more logging analysis, it seems like the change to {{HierarchicalAllocatorProcess::updateAllocation()}} in [r/55359 | https://reviews.apache.org/r/55359/diff/6#index_header] might have taken out the crashing scenario because it now updates the {{frameworkSorter}} by {{offeredResources}} rather than {{frameworkAllocation}}, so the {{frameworkSorter}} stays in an over-allocated situation during the race condition, until {{Master::_accept}} calls {{allocator->recoverResources}} from the offer, in which the over allocation gets corrected. [~bmahler][~xujyan], can you please comment on whether my reading of above is correct? If so, I suspect we don't have a verified way to trigger a master crash due to over-allocation after 1.2.0 release? > Oversubscription could crash the master due to CHECK failure in the allocator > ----------------------------------------------------------------------------- > > Key: MESOS-7639 > URL: https://issues.apache.org/jira/browse/MESOS-7639 > Project: Mesos > Issue Type: Bug > Reporter: Yan Xu > > As I described in MESOS-7566, the following scenario is possible when the agent sends updated oversubscribed resources to the master: > - The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources. > - {{Master::updateSlave}} upon receiving the update would first call {{HierarchicalAllocatorProcess::updateSlave}}, followed by {{allocator->recoverResources}}. > - {{HierarchicalAllocatorProcess::updateSlave}} would update {{roleSorter.total_}} to reduce to total so the total could go below the allocation. > - In the subsequent {{allocator->recoverResources}} call the attempt to remove outstanding allocation may fail to reduce it to below the total because some allocation may not be in outstanding offers. It could be in offered resources pending between {{Master::accept}} and {{Master::_accept}}. So the end result could still be {{total < allocation}}. > - Then when {{Master::_accept}} is executed, it will then call {{allocator->updateAllocation}}, in which the {{total < allocation}} condition could trigger such crash. > The gist is that there are resources that are neither in master's {{offers}} or tracked in the allocator when {{Master::updateSlave}} is called. -- This message was sent by Atlassian JIRA (v6.4.14#64029)