Return-Path: X-Original-To: apmail-mesos-issues-archive@minotaur.apache.org Delivered-To: apmail-mesos-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 597B318B15 for ; Fri, 19 Jun 2015 23:12:01 +0000 (UTC) Received: (qmail 11810 invoked by uid 500); 19 Jun 2015 23:12:01 -0000 Delivered-To: apmail-mesos-issues-archive@mesos.apache.org Received: (qmail 11782 invoked by uid 500); 19 Jun 2015 23:12:01 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 11770 invoked by uid 99); 19 Jun 2015 23:12:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jun 2015 23:12:01 +0000 Date: Fri, 19 Jun 2015 23:12:01 +0000 (UTC) From: "Jie Yu (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (MESOS-2891) Performance regression in hierarchical allocator. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu resolved MESOS-2891. --------------------------- Resolution: Fixed Fix Version/s: 0.23.0 > Performance regression in hierarchical allocator. > ------------------------------------------------- > > Key: MESOS-2891 > URL: https://issues.apache.org/jira/browse/MESOS-2891 > Project: Mesos > Issue Type: Bug > Components: allocation, master > Reporter: Benjamin Mahler > Assignee: Jie Yu > Priority: Blocker > Labels: twitter > Fix For: 0.23.0 > > Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg > > > For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add: > {noformat:title=45 minute delay} > I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695 > I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695 > {noformat} > Empirically, [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462] and [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533] have become expensive. > Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to {{addSlave}} and {{updateSlave}}, when there are tens of thousands of slaves this amounts to the large delay seen above. > We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator. > A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)