Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EBA7C200BF4 for ; Thu, 1 Dec 2016 11:14:08 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id EA653160B0F; Thu, 1 Dec 2016 10:14:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 43F6E160B25 for ; Thu, 1 Dec 2016 11:14:08 +0100 (CET) Received: (qmail 9301 invoked by uid 500); 1 Dec 2016 10:14:04 -0000 Mailing-List: contact issues-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list issues@aurora.apache.org Received: (qmail 9100 invoked by uid 99); 1 Dec 2016 10:14:00 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2016 10:14:00 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7D4A42C2A6B for ; Thu, 1 Dec 2016 10:13:58 +0000 (UTC) Date: Thu, 1 Dec 2016 10:13:58 +0000 (UTC) From: "David McLaughlin (JIRA)" To: issues@aurora.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 01 Dec 2016 10:14:09 -0000 [ https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711556#comment-15711556 ] David McLaughlin commented on AURORA-1840: ------------------------------------------ I figured this out. The old zk commons approach was a session-based SingletonService. The new Curator-based recipe is based on Curator's LeaderLatch which is explicitly a connection-based leadership concept: From http://curator.apache.org/curator-recipes/leader-latch.html {quote}LeaderLatch instances add a *ConnectionStateListener* to watch for connection problems. If SUSPENDED or LOST is reported, the LeaderLatch that is the leader will report that it is no longer the leader (i.e. *there will not be a leader until the connection is re-established*). If a LOST connection is RECONNECTED, the LeaderLatch will delete its previous ZNode and create a new one.{quote} This is a really terrible idea for a Scheduler that can take minutes to fail over. We'll need to revert the above commits until someone can come up with a better Curator-backed recipe. > Issue with Curator-backed discovery under heavy load > ---------------------------------------------------- > > Key: AURORA-1840 > URL: https://issues.apache.org/jira/browse/AURORA-1840 > Project: Aurora > Issue Type: Bug > Components: Scheduler > Reporter: David McLaughlin > Priority: Blocker > > We've been having some performance issues recently with our production clusters at Twitter. A side-effect of these are occassional stop-the-world GC pauses for up to 15 seconds. This has been happening at our scale for quite some time, but previous versions of the Scheduler were resilient to this and no leadership change would occur. > Since we moved to Curator, we are no longer resilient to these GC pauses. The Scheduler is now failing over any time we see a GC pause, even though these pauses are within the session timeout. Here is an example pause in the scheduler logs with the associated ZK session timeout that leads to a failover: > {code} > I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 586236ns > I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING -> ASSIGNED > I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 > I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. > W1118 19:40:31.744 [main-SendThread(redacted:2181), ClientCnxn$SendThread:1108] Client session timed out, have not heard from server in 20743ms for sessionid 0x6584fd2b34ede86 > {code} > As you can see from the timestamps, there was a 15s GC pause (confirmed in our GC logs - a CMS promotion failure caused the pause) and this triggers a session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK timeout, so our session timeout is being wired through fine. > We have confirmed that the Scheduler no longer fails over when deploying from HEAD with these two commits reverted and setting zk_use_curator to false: > https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f > https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5 > This is a pretty big blocker for us given how expensive Scheduler failovers are (currently several minutes for us). -- This message was sent by Atlassian JIRA (v6.3.4#6332)