Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6CB24200C6C for ; Fri, 5 May 2017 23:36:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6B4C8160BAF; Fri, 5 May 2017 21:36:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B19B2160BAA for ; Fri, 5 May 2017 23:36:16 +0200 (CEST) Received: (qmail 71330 invoked by uid 500); 5 May 2017 21:36:15 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 71313 invoked by uid 99); 5 May 2017 21:36:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 May 2017 21:36:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 434211A7B20; Fri, 5 May 2017 21:36:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.249 X-Spam-Level: *** X-Spam-Status: No, score=3.249 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id RXlVEwBHO_Da; Fri, 5 May 2017 21:36:14 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 3E9295F569; Fri, 5 May 2017 21:36:13 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 996A3E01A8; Fri, 5 May 2017 21:36:12 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id 6CB92C402E3; Fri, 5 May 2017 21:36:12 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============3135223217456646051==" MIME-Version: 1.0 Subject: Review Request 59030: AURORA-1869 Reducing storage write lock contention in TaskStatusHandlerImpl From: Mehrdad Nurolahzade To: David McLaughlin , Stephan Erb , Zameer Manji Cc: Aurora , Mehrdad Nurolahzade Date: Fri, 05 May 2017 21:36:12 -0000 Message-ID: <20170505213612.44875.39462@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Mehrdad Nurolahzade X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/59030/ X-Sender: Mehrdad Nurolahzade Reply-To: Mehrdad Nurolahzade X-ReviewRequest-Repository: aurora archived-at: Fri, 05 May 2017 21:36:17 -0000 --===============3135223217456646051== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/59030/ ----------------------------------------------------------- Review request for Aurora, David McLaughlin, Stephan Erb, and Zameer Manji. Bugs: AURORA-1869 https://issues.apache.org/jira/browse/AURORA-1869 Repository: aurora Description ------- `TaskStatusHandlerImpl` acquires `LogStorage` write lock for processing every status update received from Mesos master. During implicit and explicit reconciliations, this amounts to the number of tasks in the cluster (tens of thousands of times in our cluster). According to data extracted from one of our production clusters, over 99.9% of reconciliation status update events are in fact `NOOP` status updates. The storage write lock contention induced by these status updates can simply be eliminated by adopting double-checked locking pattern (as was done in [AURORA-1820](https://issues.apache.org/jira/browse/AURORA-1820)). This explains why the combination of reconciliation status update processing and other expensive processes like snapshot can be fatal for scheduler. As the lock is not fair, it does not guarantee any particular access order. Therefore, snapshot structures might need to sit on the heap for a few seconds before they can be written to `LogStorage` and garbage collected. Diffs ----- src/main/java/org/apache/aurora/scheduler/TaskStatusHandlerImpl.java 1aacecf3c2597a3f91dbc7da4c99fd1e80970f04 src/test/java/org/apache/aurora/scheduler/TaskStatusHandlerImplTest.java 56a6b0c9ae8da18e9a47428b8ed37a559cfd04e7 src/test/java/org/apache/aurora/scheduler/storage/testing/StorageTestUtil.java 21d26b3930ea965487b2dec48a48a98677ba022b Diff: https://reviews.apache.org/r/59030/diff/1/ Testing ------- TBD under a test cluster Thanks, Mehrdad Nurolahzade --===============3135223217456646051==--