Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9F9C1200C7D for ; Tue, 2 May 2017 03:28:38 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9E481160BC2; Tue, 2 May 2017 01:28:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id BDAF4160BC1 for ; Tue, 2 May 2017 03:28:37 +0200 (CEST) Received: (qmail 14862 invoked by uid 500); 2 May 2017 01:28:33 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 14574 invoked by uid 99); 2 May 2017 01:28:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 May 2017 01:28:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id AC8091A0043; Tue, 2 May 2017 01:28:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.251 X-Spam-Level: *** X-Spam-Status: No, score=3.251 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, RP_MATCHES_RCVD=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 6hj5AkE5OwH4; Tue, 2 May 2017 01:28:29 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 275975F3BB; Tue, 2 May 2017 01:28:29 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 891CEE0045; Tue, 2 May 2017 01:28:28 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id 4DD30C403E9; Tue, 2 May 2017 01:28:28 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============6701173983991607770==" MIME-Version: 1.0 Subject: Re: Review Request 58259: Add update affinity to Scheduler From: David McLaughlin To: Santhosh Kumar Shanmugham , Stephan Erb , Zameer Manji Cc: Aurora , Mehrdad Nurolahzade , Aurora ReviewBot , David McLaughlin Date: Tue, 02 May 2017 01:28:28 -0000 Message-ID: <20170502012828.62744.99268@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: David McLaughlin X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/58259/ X-Sender: David McLaughlin References: <20170425030101.17468.7251@reviews-vm2.apache.org> In-Reply-To: <20170425030101.17468.7251@reviews-vm2.apache.org> X-ReviewBoard-Diff-For: src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java X-ReviewBoard-Diff-For: src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java X-ReviewBoard-Diff-For: src/test/java/org/apache/aurora/scheduler/updater/UpdateAgentReserverImplTest.java Reply-To: David McLaughlin X-ReviewRequest-Repository: aurora archived-at: Tue, 02 May 2017 01:28:38 -0000 --===============6701173983991607770== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On April 25, 2017, 3:01 a.m., David McLaughlin wrote: > > We have complete initial scale testing of this patch with updates spanning 10 to 10k instances across 10k agents. Here are the findings: > > > > 1) The patch works great for small and medium sized updates. > > 2) For large updates things start with significant performance upgrades and then eventually degrade, causing cache hits to degrade to almost 0% (where it resorts to performance on master). > > 3) Initially we believed the offers were taking too long due to compaction, but the overhead there turned out to be only a couple of seconds. > > 4) We believe we have root caused the degrading cache hits to interference from the task history pruner. > > 5) Expanding the timeout to 2 minutes doesn't seem to help either, the performance degradation due to (4) is quite severe. > > > > See attached screenshots. > > > > Anecdotally, this explains an issue we've frequently witnessed when extremely large services (5~8k instances) caused cluster-wide slowdown even when capacity was readily available. > > > > Next steps are to confirm and address the task history pruning issue. Another update: After (a lot) of testing, we tracked this down to the scheduling penalty in TaskGroups. Unfortunately there is a bug in the penalty metric calculation (the counter isn't incremented when no tasks in a batch manage to be scheduled) which meant we falsely ruled this out. After ruling out GC and the async workers, we revisited the metric calculation and discovered the bug. From there, we were able to tune various settings to improve cache hit performance. But there are also sometimes still cases where cache hit % degrades to 0 and stays there for large updates. Tuning is complicated because you have to consider different update batch sizes vs number of concurrent updates vs max schedule attempts vs tasks per group (and every other setting in SchedulingModule really). On top of all of this, you also need to tune carefully to avoid being adversely affected by your chronically failing and permanently pending tasks too. The goal is to make sure the tasks waiting for reservations to be freed up aren't punished too heavily, without also repeating work for bad actors. Probably the worst property is the fact that once you start getting cache misses, it's very hard to recover - this is because a cache miss falls back to the regular scheduling algorithm which can also fail to finding matching offers and this only adds to the delay. We could probably avoid most of these issues if we could somehow connect the killing of tasks for updates into the currently scheduling throughput... but that would require a huge refactor. Currently we manage 100% cache hit with high number of concurrent updates (~1k+ instances updated per minute) by lowering the worst case scheduling penalty and increasing the number of tasks considered per job. It's also worth noting we'd also see the behavior we've ran into with dynamic reservations that had 1 minute timeouts. - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58259/#review172889 ----------------------------------------------------------- On April 25, 2017, 3:03 a.m., David McLaughlin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58259/ > ----------------------------------------------------------- > > (Updated April 25, 2017, 3:03 a.m.) > > > Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji. > > > Repository: aurora > > > Description > ------- > > In the Dynamic Reservations review (and on the mailing list), I mentioned that we could implement update affinity with less complexity using the same technique as preemption. Here is how that would work. > > This just adds a simple wrapper around the preemptor's BiCache structure and then optimistically tries to keep an agent free for a task during the update process. > > > Note: I don't bother even checking the resources before reserving the agent. I figure there is a chance the agent has enough room, and if not we'll catch it when we attempt to veto the offer. We need to always check the offer like this anyway in case constraints change. In the worst case it adds some delay in the rare cases you increase resources. > > We also don't persist the reservations, so if the Scheduler fails over during an update, the worst case is that any instances between the KILLED and ASSIGNED in-flight batch need to fall back to the current first-fit scheduling algorithm. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 > src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 203f62bacc47470545d095e4d25f7e0f25990ed9 > src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java a177b301203143539b052524d14043ec8a85a46d > src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java f25dc0c6d9c05833b9938b023669c9c36a489f68 > src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java c129896d8cd54abd2634e2a339c27921042b0162 > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java e14112479807b4477b82554caf84fe733f62cf58 > src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java c95943d242dc2f539778bdc9e071f342005e8de3 > src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java PRE-CREATION > src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 13cbdadad606d9acaadc541320b22b0ae538cc5e > src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java fa1a81785802b82542030e1aae786fe9570d9827 > src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java cf2d25ec2e407df7159e0021ddb44adf937e1777 > src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java b2c4c66850dd8f35e06a631809530faa3b776252 > src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java c78c7fbd7d600586136863c99ce3d7387895efee > src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 30b44f88a5b8477e917da21d92361aea1a39ceeb > src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 833fd62c870f96b96343ee5e0eed0d439536381f > src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java PRE-CREATION > src/test/java/org/apache/aurora/scheduler/updater/UpdateAgentReserverImplTest.java PRE-CREATION > > > Diff: https://reviews.apache.org/r/58259/diff/2/ > > > Testing > ------- > > ./gradlew build > ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > > > File Attachments > ---------------- > > Cache utilization over time > https://reviews.apache.org/media/uploaded/files/2017/04/25/7b41bd2b-4151-482c-9de2-9dee67c34133__declining-cache-hits.png > Offer rate from Mesos over time > https://reviews.apache.org/media/uploaded/files/2017/04/25/b107d964-ee7d-435a-a3d9-2b54f6eac3fa__consistent-offer-rate.png > Async task workload (scaled) correlation with degraded cache utilization > https://reviews.apache.org/media/uploaded/files/2017/04/25/7eaf37ac-fbf3-40eb-b3f6-90e914a3936f__async-task-correlation.png > > > Thanks, > > David McLaughlin > > --===============6701173983991607770==--