Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 160C6200BE9 for ; Mon, 12 Dec 2016 01:38:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 14AEF160B2C; Mon, 12 Dec 2016 00:38:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5F27C160B20 for ; Mon, 12 Dec 2016 01:38:00 +0100 (CET) Received: (qmail 43414 invoked by uid 500); 12 Dec 2016 00:37:59 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 43371 invoked by uid 99); 12 Dec 2016 00:37:59 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Dec 2016 00:37:59 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 6988A2FC48C; Mon, 12 Dec 2016 00:37:58 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============9011430665136639805==" MIME-Version: 1.0 Subject: Re: Review Request 54288: Make leader elections resilient to ZK disconnections. From: Zameer Manji To: Joshua Cohen , David McLaughlin , John Sirois , Stephan Erb Cc: Aurora , Karthik Anantha Padmanabhan , Zameer Manji , Aurora ReviewBot Date: Mon, 12 Dec 2016 00:37:58 -0000 Message-ID: <20161212003758.17251.93584@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Zameer Manji X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/54288/ X-Sender: Zameer Manji References: <20161208234201.1680.33881@reviews.apache.org> In-Reply-To: <20161208234201.1680.33881@reviews.apache.org> Reply-To: Zameer Manji X-ReviewRequest-Repository: aurora archived-at: Mon, 12 Dec 2016 00:38:01 -0000 --===============9011430665136639805== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Dec. 8, 2016, 3:42 p.m., Zameer Manji wrote: > > Does anyone know how to get the test reports from jenkins or have an idea of what's going on? > > John Sirois wrote: > Yes, these are legit failures, no Jenkins logs needed, just this is enough: > ``` > org.apache.aurora.scheduler.discovery.CuratorSingletonServiceTest > testAbdicateTransition FAILED > java.lang.AssertionError at CuratorSingletonServiceTest.java:125 > java.lang.AssertionError > > org.apache.aurora.scheduler.discovery.CuratorSingletonServiceTest > testLeadAdvertise FAILED > java.lang.AssertionError at CuratorSingletonServiceTest.java:94 > java.lang.AssertionError > ``` > > I looked at the testLeadAdvertise one 1st which [blocks until an ephemeral node is added](https://github.com/apache/aurora/blob/master/src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java#L82-L83) to the leadership group path as the signal that leadership has been already synchronously obtained. The assumption is broken with LeaderSelector since the transition is [fired on an executor runnable](https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderSelector.java#L239) and could be delayed arbitrarily past the test assertion. > > In short, you're getting lucky having these 2 (and maybe more) pass on your machine. Needs careful re-review of the test infra and the new behavior of the LeaderSelector using the thread pool to do leadership transitions async. Good catch. The tests already have an `awaitCapture` and since I am adding a timeout, we can use that to ensure the listeners are fired instead of plain asserts. - Zameer ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54288/#review158601 ----------------------------------------------------------- On Dec. 8, 2016, 3:28 p.m., Zameer Manji wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/54288/ > ----------------------------------------------------------- > > (Updated Dec. 8, 2016, 3:28 p.m.) > > > Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and Stephan Erb. > > > Bugs: AURORA-1669 > https://issues.apache.org/jira/browse/AURORA-1669 > > > Repository: aurora > > > Description > ------- > > As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates > leadership if the ZK connection is lost or if there is a timeout. This is not > compatible with the commons based implementation which would only abdicate > leadership if the ZK session timeout occurred. > > This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a > custom listener that only loses leadership if a connection loss occurs. > > > Diffs > ----- > > commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java 50acaeba82e163f8f2970a264cbd889c9eb3b5ed > src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java c378172c850aafe0a9381552b5067277b40dbfab > src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 > src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java 6ea49b0c690d288ff59d1d4798144bfa2d153d3a > > Diff: https://reviews.apache.org/r/54288/diff/ > > > Testing > ------- > > > Thanks, > > Zameer Manji > > --===============9011430665136639805==--