Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EA05F200D57 for ; Mon, 11 Dec 2017 21:49:45 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id E874C160C13; Mon, 11 Dec 2017 20:49:45 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 11D24160C03 for ; Mon, 11 Dec 2017 21:49:44 +0100 (CET) Received: (qmail 32031 invoked by uid 500); 11 Dec 2017 20:49:44 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 32019 invoked by uid 99); 11 Dec 2017 20:49:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Dec 2017 20:49:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 38CC2C64AC for ; Mon, 11 Dec 2017 20:49:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id qCtCDP-Pp-r4 for ; Mon, 11 Dec 2017 20:49:35 +0000 (UTC) Received: from mail-wm0-f54.google.com (mail-wm0-f54.google.com [74.125.82.54]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 435646347B for ; Mon, 11 Dec 2017 20:14:45 +0000 (UTC) Received: by mail-wm0-f54.google.com with SMTP id f9so16528479wmh.0 for ; Mon, 11 Dec 2017 12:14:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=F35K8uRJROQcUdwgeS7tH2VQO7aWYfLD0+EhgVysOMI=; b=RckQ/HGqMjvNleTScZNzpWnCxU1mSjHw+s/PG/W/pfvKn6MA0Jc/if8zjuZoFXeBn7 J/QHn1OECFGRoqbX7E+Pb7WzL4MLC3OGsZhTOZFjg/xv9dqdlAqFAwZrU9b3PN7YFV/A SOzUI/xWMP/3w7bXsfDVPDmhV9MrBStmClOTxMjHNeTVZFWAo1gw4HoWTqqrkJilAHdR QEvAq/zms+cEel0VkPhDn48vG3XdxQQZ4uJhTLzWilOflsLyp+GMHjzoP5d7Lcc4UuV5 ePtdAoEwRw6gUMOa+MddJamXh08bYGM8D6qbIkB9wHK2vUUaaSJBx93CIZHgTlFKIzxz 1r/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=F35K8uRJROQcUdwgeS7tH2VQO7aWYfLD0+EhgVysOMI=; b=P4DFmnMws8FYjO/2NlyeH6CJCVVtVzBI2/YdoCwMNO0RSGPoQMwahW9G3SRbFI6ect NB5KV9Djkp5sPZT05kzBoQLUjIjNmRyWYUjMQd1gjfnEDe/7k4IwBe2gvp/5OL/B3dZ+ fT1j8ep1EuvMteAGBs535EWtZwEBl3tiWXocE0rPIKXhvnnFMwrB4LVBE+xeUHEZLMAg WuyPFhmuZyFDIZdIOcZc/qESuO2jHT4l7IKMSBugM6o8OirJC4qOqU6/S3JQXRmL8GSV 3sF5NKoO8s+1/iJ55DSqJdh2HV/JJOGMXU0WB7TgUddcodUCdTWA1UL30TjRoM3WCoBy DrNw== X-Gm-Message-State: AKGB3mK83ZDfugcTgEkcVSVkjCPL9Q4shc1c5TW+r+HDLWBIeN/NN8fF MnJXKdqt1oMm9ERLxxCTpkBwJeQU X-Google-Smtp-Source: ACJfBot91iVHRKdu2v4aoubnqk5ieuoZL8V2GAZi3cc99bkBlnKj3bYXJYEp5InuKxmg/30phwM5Rg== X-Received: by 10.80.137.195 with SMTP id h3mr2496699edh.295.1513023284532; Mon, 11 Dec 2017 12:14:44 -0800 (PST) Received: from mail-wm0-f41.google.com (mail-wm0-f41.google.com. [74.125.82.41]) by smtp.gmail.com with ESMTPSA id o15sm6802184edk.25.2017.12.11.12.14.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 11 Dec 2017 12:14:43 -0800 (PST) Received: by mail-wm0-f41.google.com with SMTP id 9so16771232wme.4 for ; Mon, 11 Dec 2017 12:14:43 -0800 (PST) X-Received: by 10.28.156.208 with SMTP id f199mr1967851wme.27.1513023283259; Mon, 11 Dec 2017 12:14:43 -0800 (PST) MIME-Version: 1.0 Received: by 10.223.136.36 with HTTP; Mon, 11 Dec 2017 12:14:42 -0800 (PST) In-Reply-To: <5a51c1a6a62946ad9eef4e92241deeb2@git.apache.org> References: <5a51c1a6a62946ad9eef4e92241deeb2@git.apache.org> From: Jie Yu Date: Mon, 11 Dec 2017 12:14:42 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: mesos git commit: Added a performance working group December 2017 blog post. To: dev Content-Type: multipart/alternative; boundary="001a114b7de85608b70560162f84" archived-at: Mon, 11 Dec 2017 20:49:46 -0000 --001a114b7de85608b70560162f84 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable This is AWESOME! - Jie On Mon, Dec 11, 2017 at 11:54 AM, wrote: > Repository: mesos > Updated Branches: > refs/heads/master 83f81b7b2 -> e7244ae1e > > > Added a performance working group December 2017 blog post. > > This blog post discusses the master failover performance improvements > that were made in the past few months. > > > Project: http://git-wip-us.apache.org/repos/asf/mesos/repo > Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/e7244ae1 > Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/e7244ae1 > Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/e7244ae1 > > Branch: refs/heads/master > Commit: e7244ae1eb84a8bfcbe2940107c7f97a53832cf2 > Parents: 83f81b7 > Author: Benjamin Mahler > Authored: Mon Dec 11 11:52:43 2017 -0800 > Committer: Benjamin Mahler > Committed: Mon Dec 11 11:52:43 2017 -0800 > > ---------------------------------------------------------------------- > .../1.3-1.5_master_failover_no_history.png | Bin 0 -> 115884 bytes > .../1.3-1.5_master_failover_with_history.png | Bin 0 -> 140716 bytes > ...performance-working-group-progress-report.md | 90 ++++++++++++++++++= + > 3 files changed, 90 insertions(+) > ---------------------------------------------------------------------- > > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > docs/images/1.3-1.5_master_failover_no_history.png > ---------------------------------------------------------------------- > diff --git a/docs/images/1.3-1.5_master_failover_no_history.png > b/docs/images/1.3-1.5_master_failover_no_history.png > new file mode 100644 > index 0000000..c1dca64 > Binary files /dev/null and b/docs/images/1.3-1.5_master_failover_no_histo= ry.png > differ > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > docs/images/1.3-1.5_master_failover_with_history.png > ---------------------------------------------------------------------- > diff --git a/docs/images/1.3-1.5_master_failover_with_history.png > b/docs/images/1.3-1.5_master_failover_with_history.png > new file mode 100644 > index 0000000..4f3deec > Binary files /dev/null and b/docs/images/1.3-1.5_master_failover_with_his= tory.png > differ > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > site/source/blog/2017-12-7-performance-working-group-progress-report.md > ---------------------------------------------------------------------- > diff --git a/site/source/blog/2017-12-7-performance-working-group- > progress-report.md b/site/source/blog/2017-12-7-performance-working-group= - > progress-report.md > new file mode 100644 > index 0000000..7e42642 > --- /dev/null > +++ b/site/source/blog/2017-12-7-performance-working-group- > progress-report.md > @@ -0,0 +1,90 @@ > +--- > +layout: post > +title: December 2017 Performance Working Group Progress Report > +published: true > +post_author: > + display_name: Benjamin Mahler > + gravatar: fb43656d4d45f940160c3226c53309f5 > + twitter: bmahler > +tags: Performance > +--- > + > +**Scalability and performance are key features for Mesos. Some users of > Mesos already run production clusters that consist of more than 35,000+ > agents and 100,000+ active tasks.** However, there remains a lot of room > for improvement across a variety of areas of the system. > + > +The performance working group was created in order to focus on some of > these areas. The group's charter is to improve scalability / throughput / > latency across the system, and in order to measure our improvements and > prevent performance regressions we will write benchmarks and automate the= m. > + > +In the past few months, we've focused on making improvements to the > following areas: > + > +* **Master failover time-to-completion**: Achieved a 450-600% improvemen= t > in throughput, which reduces the time-to-completion by 80-85%. > +* **[Libprocess](https://github.com/apache/mesos/tree/master/ > 3rdparty/libprocess) message passing throughput**: These improvements > will be covered in a separate blog post. > + > +Before we dive into the master failover improvements, I would like to > recognize and thank the following contributors: > + > +* **Dmitry Zhuk**: for writing *a lot* of patches for improving the > master failover performance. > +* **Michael Park**: for reviewing and shipping many of Dmitry's more > challenging patches. > +* **Yan Xu**: for writing the master failover benchmark that was the > basis for measuring the improvements. > + > +## Master Failover Time-To-Completion > + > +Our first area of focus was to improve the time it takes for a master > failover to complete, where completion is defined as all of the agents > successfully re-registering. Mesos is architected to use a centralized > master with standby masters that participate in a quorum for high > availability. For scalability reasons, the leading master stores the stat= e > of the cluster in-memory. During a master failover, the leading master > needs to therefore re-build the in-memory state from all of the agents th= at > re-register. During this time, the master is available to process other > requests, but will be exposing only partial state to API consumers. > + > +The rebuilding of the master=E2=80=99s in-memory state can be expensive = for > larger clusters, and so the focus of this effort was to improve the > efficiency of this. Improvements were made via several areas, and only th= e > highest-impact changes are listed below: > + > +### Protobuf 3.5.0 Move Support > + > +We upgraded to protobuf 3.5.0 in order to gain move support. When we > profiled the master, we found that it spent a lot of time copying protobu= f > messages during agent re-registration. This support allowed us to elimina= te > copies of protobuf messages while retaining value semantics. > + > +### Move Support and Copy Elimination in Libprocess `dispatch` / `defer` > / `install` > + > +Libprocess provides several primitives for message passing: > + > +* `dispatch`: Provides the ability to post a messages to a local `Proces= s` > +* `defer`: Provides a deferred `dispatch`. i.e. a function object that > when invoked will issue a `dispatch`. > +* `install`: Installs a handler for receiving a protobuf message. > + > +These primitives did not have move support, as they were originally adde= d > prior to the addition of C++11 support to the code-base. In order to > eliminate copies, we enhanced these primitives to support moving argument= s > in and out. > + > +This required introducing a new C++ utility, because `defer` takes on th= e > same API as `std::bind` (e.g., placeholders). Specifically, the function > object returned by `std::bind` does not move the bound arguments into the > stored callable. In order to enable this, `defer` now uses a utility we > introduced called `lambda::partial` rather than `std::bind`. > `lambda::partial` performs partial function application similar to > `std::bind` except the returned function object moves the bound arguments > into the stored callable if the invocation is performed on an r-value > function object. > + > +### Copy Elimination in the Master > + > +With these previous enhancements in place, we were able to eliminate man= y > of the expensive copies of protobuf messages performed by the master. > + > +### Benchmark and Results > + > +We wrote a synthetic benchmark to simulate a master failover. This > benchmark prepares all the messages that would be sent to the master by t= he > agents that need to re-register: > + > +* The benchmark uses synthetic agents in that they are just an actor tha= t > knows how to re-register with the master. > +* Each "agent" will send a configurable number of active and completed > tasks belonging to a configurable number of active and completed framewor= ks. > +* Each task has 10 small labels to introduce metadata overhead. > + > +The benchmark has a few caveats: > + > +* It does not use executors (this would show improved results over what > is shown below, but for simplicity the benchmark omits them) > +* It uses local message passing, whereas a real cluster would be passing > messages over HTTP. > +* It uses a quorum size of 1, so writes to the master=E2=80=99s registry= occur > only on single local log replica. > +* The synthetic agents do not retry their re-registration, whereas > typically agents will retry with a backoff. > + > +This was tested on a 2015 Macbook Pro with 2.8 GHz Intel Core i7 > processor. Mesos was configured using: `Apple LLVM version 9.0.0 > (clang-900.0.38)`, with `-O2` enabled in 1.5.0. > + > +The first results represent a cluster with 10 active tasks per agent > across 5 frameworks, with no completed tasks. The results from 1,000 - > 40,000 agents with 10,000 - 400,000 active tasks: > + > +![1.3 - 1.5 Master Failover without Task History Graph](/assets/img/ > documentation/1.3-1.5_master_failover_no_history.png) > + > +There was a reduction in the time-to-completion of ~80% due to a 450-500= % > improvement in throughput across 1.3.0 to 1.5.0. > + > +The second results add task history: each agent also now contains 100 > completed tasks across 5 completed frameworks. The results from 1,000 - > 40,000 agents with 10,000 - 400,000 active tasks and 100,000 - 4,000,000 > completed tasks are shown below: > + > +![1.3 - 1.5 Master Failover with Task History Graph](/assets/img/ > documentation/1.3-1.5_master_failover_with_history) > + > +This represents a reduction in time-to-completion of ~85% due to a > 550-700% improvement in throughput across 1.3.0 to 1.5.0. > + > +## Performance Working Group Roadmap > + > +We're currently targeting the following areas for improvements: > + > +* **Performance of the v1 API**: Currently the v1 API can be > significantly slower than the v0 API. We would like to reach parity, and > ideally surpass the performance of the v0 API. > + * **[Libprocess](https://github.com/apache/mesos/tree/master/ > 3rdparty/libprocess) HTTP performance**: This will be undertaken as part > of improving the v1 API performance, since it is HTTP-based. > +* **Master state API performance**: Currently, API queries of the > master's state are serviced by the same master actor that processes all o= f > the messages from schedulers and agents. Since the query processing can > block the master from processing other events, users need to be careful n= ot > to query the master excessively. In practice, the master gets queried qui= te > heavily due to the presence of several tools that rely on the master's > state (e.g. DNS tooling, UIs, CLIs, etc) and so this is a critical proble= m > for users. This effort will leverage the state streaming API to stream th= e > state to a different actor that can serve the state API requests. This wi= ll > ensure that expensive state queries do not affect the master's ability to > process events. > + > +If you are a user and would like to suggest some areas for performance > improvement, please let us know by emailing . > > --001a114b7de85608b70560162f84--