Return-Path: X-Original-To: apmail-mesos-reviews-archive@minotaur.apache.org Delivered-To: apmail-mesos-reviews-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0C4CF182DA for ; Fri, 16 Oct 2015 18:06:41 +0000 (UTC) Received: (qmail 37514 invoked by uid 500); 16 Oct 2015 18:06:41 -0000 Delivered-To: apmail-mesos-reviews-archive@mesos.apache.org Received: (qmail 37489 invoked by uid 500); 16 Oct 2015 18:06:40 -0000 Mailing-List: contact reviews-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@mesos.apache.org Delivered-To: mailing list reviews@mesos.apache.org Received: (qmail 37471 invoked by uid 99); 16 Oct 2015 18:06:40 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Oct 2015 18:06:40 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id B62D72A3CDC; Fri, 16 Oct 2015 18:06:38 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============8938031120152115857==" MIME-Version: 1.0 Subject: Re: Review Request 39325: Fixed race between coordinator election and recovery in replicated log. From: "Neil Conway" To: "Joris Van Remoortere" , "Timothy Chen" , "Jie Yu" Cc: "Neil Conway" , "mesos" Date: Fri, 16 Oct 2015 18:06:38 -0000 Message-ID: <20151016180638.28636.71720@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: "Neil Conway" X-ReviewGroup: mesos X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/39325/ X-Sender: "Neil Conway" References: <20151015003146.1509.53555@reviews.apache.org> In-Reply-To: <20151015003146.1509.53555@reviews.apache.org> Reply-To: "Neil Conway" X-ReviewRequest-Repository: mesos --===============8938031120152115857== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Oct. 15, 2015, 12:31 a.m., Jie Yu wrote: > > src/messages/log.proto, line 148 > > > > > > I am just thinking about the rolling upgrade case. What happens if the old coordinator receives a response from a new replica. According to the current semantics, it'll be treated as a NACK. Is that OK? Will that unnecessarily demote the old coordinator? > > > > If you think that's OK, please add a comment about why it is OK (e.g., it'll eventually succeed when all replicas/coordinator are updated). If that's the case, we definitely need to call out this in upgrades.md. > > > > Ditto for the 'ignored' field in WriteResponse. As I see it, we have three options when the candidate coordinator is old (<= 0.25) but one or more replicas are new: 1. New replica sends "ignored = true, okay = false", which means that the coordinator will view this as a NACK and potentially get demoted. 2. New replica sends "ignored = true, okay = true": clearly this is wrong 3. New replica only sends an "ignored" response if the master is sufficiently new; otherwise, it keeps the old behavior of silently ignoring the promise/write request Obviously #2 is a non-starter; I think #1 should be fine, and is preferrable to the complexity of having replicas track the coordinator's version. - Neil ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39325/#review102710 ----------------------------------------------------------- On Oct. 15, 2015, 1:21 a.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/39325/ > ----------------------------------------------------------- > > (Updated Oct. 15, 2015, 1:21 a.m.) > > > Review request for mesos, Jie Yu, Joris Van Remoortere, and Timothy Chen. > > > Bugs: MESOS-3280 > https://issues.apache.org/jira/browse/MESOS-3280 > > > Repository: mesos > > > Description > ------- > > MESOS-3280. The basic problem is that replicas silently ignore inbound Promise > and Write requests if they have not finished the recovery protocol yet (because > they can't safely vote on such requests). Hence, if we try to do a Paxos round > while a quorum of nodes have not finished recovering, the Paxos round will never > complete. In particular, this might happen during coordinator election: > coordinator election (which is implemented as performing a full Paxos round) > starts as soon as the candidate coordinator replica has finished the recovery > protocol. If several nodes start concurrently, a quorum of those nodes might > still be executing the recovery protocol, and hence the coordinator will never > be elected. > > To address this, add "ignored" responses to the Promise and Write sub-protocols: > if a proposer sees a quorum of "ignored" responses to a promise or write request > it has issued, it knows the request will never succeed. When used for > coordinator election, the current coding will retry immediately (without a > backoff). > > Note that replicas will still silently drop promise/write requests if another > kind of problem occurs (e.g., an I/O error prevents reading/writing log > data). We might consider changing this, although it will require some thought: > e.g., if a replica's disk is broken, sending an "ignored" message on every > request might flood the network. > > CODE REVIEW TO DISCUSS / FIX: > > * Test mock is incredibly ugly: it works, but we clearly need a better approach > before committing this. I've been chatting with @tnachen to find a better > approach but haven't got anything that works yet. > > * Should we add a backoff when retrying after a failed coordinator election? > > * Should we also send back an "ignored" response if an I/O error occurs? > > > Diffs > ----- > > src/log/consensus.cpp 59f80d02d1d3c11683631f3fc5f6e923b5ebdf96 > src/log/coordinator.cpp 5500bca77f3020e0051010c5c178a20a3c7ad44a > src/log/replica.hpp 33d3f1d9e89035936c67739898e73a06b391fcd0 > src/log/replica.cpp 75d39ff56822bf00fce9daf5c1e3befb75f2e039 > src/messages/log.proto d73b33f865963292af580945659ad0e800f2a204 > src/state/log.cpp a75a605a4b0edb8863a3378e2133df7d6eb1cc3d > src/tests/log_tests.cpp f2dd47cfbe73fb18c360a637db009b7d391a782e > src/tests/slave_tests.cpp 10a4fa7eaa8e868ccc6d60ac220d66a4f0a523b4 > > Diff: https://reviews.apache.org/r/39325/diff/ > > > Testing > ------- > > "make check" passes, including a new test that uses a newly constructed mock to ensure we're testing the message schedule described above. > > I also wrote a script stops and starts mesos-master in a loop, removing the replicated log each time. Without the patch, this occasionally fails with a "registry fetch" timeout; with the patch, you can observe several scenarios where coordinator election is reborted and retried because a quorum of ignored responses is seen. Note that in some cases, we need to retry coordinator election up to ~70 times (!), because we don't currently use a backoff; that should probably be fixed, per comments above. But the important point is that election eventually succeeds and we don't hang. > > > Thanks, > > Neil Conway > > --===============8938031120152115857==--