Mailing-List: contact reviews-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: reviews@mesos.apache.org
Content-Type: multipart/alternative;
 boundary="===============8938031120152115857=="
MIME-Version: 1.0
Subject: Re: Review Request 39325: Fixed race between coordinator election and
 recovery in replicated log.
From: "Neil Conway" <neil.conway@gmail.com>
To: "Joris Van Remoortere" <joris.van.remoortere@gmail.com>,
 "Timothy Chen" <tnachen@apache.org>, "Jie Yu" <yujie.jay@gmail.com>
Cc: "Neil Conway" <neil.conway@gmail.com>, "mesos" <reviews@mesos.apache.org>
Date: Fri, 16 Oct 2015 18:06:38 -0000
Message-ID: <20151016180638.28636.71720@reviews.apache.org>
Auto-Submitted: auto-generated
Sender: "Neil Conway" <noreply@reviews.apache.org>
References: <20151015003146.1509.53555@reviews.apache.org>
In-Reply-To: <20151015003146.1509.53555@reviews.apache.org>
Reply-To: "Neil Conway" <neil.conway@gmail.com>

--===============8938031120152115857==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit


> On Oct. 15, 2015, 12:31 a.m., Jie Yu wrote:
> > src/messages/log.proto, line 148
> > <https://reviews.apache.org/r/39325/diff/2/?file=1098525#file1098525line148>
> >
> >     I am just thinking about the rolling upgrade case. What happens if the old coordinator receives a response from a new replica. According to the current semantics, it'll be treated as a NACK. Is that OK? Will that unnecessarily demote the old coordinator?
> >     
> >     If you think that's OK, please add a comment about why it is OK (e.g., it'll eventually succeed when all replicas/coordinator are updated). If that's the case, we definitely need to call out this in upgrades.md.
> >     
> >     Ditto for the 'ignored' field in WriteResponse.

As I see it, we have three options when the candidate coordinator is old (<= 0.25) but one or more replicas are new:

1. New replica sends "ignored = true, okay = false", which means that the coordinator will view this as a NACK and potentially get demoted.
2. New replica sends "ignored = true, okay = true": clearly this is wrong
3. New replica only sends an "ignored" response if the master is sufficiently new; otherwise, it keeps the old behavior of silently ignoring the promise/write request

Obviously #2 is a non-starter; I think #1 should be fine, and is preferrable to the complexity of having replicas track the coordinator's version.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39325/#review102710
-----------------------------------------------------------


On Oct. 15, 2015, 1:21 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39325/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2015, 1:21 a.m.)
> 
> 
> Review request for mesos, Jie Yu, Joris Van Remoortere, and Timothy Chen.
> 
> 
> Bugs: MESOS-3280
>     https://issues.apache.org/jira/browse/MESOS-3280
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> MESOS-3280. The basic problem is that replicas silently ignore inbound Promise
> and Write requests if they have not finished the recovery protocol yet (because
> they can't safely vote on such requests). Hence, if we try to do a Paxos round
> while a quorum of nodes have not finished recovering, the Paxos round will never
> complete. In particular, this might happen during coordinator election:
> coordinator election (which is implemented as performing a full Paxos round)
> starts as soon as the candidate coordinator replica has finished the recovery
> protocol. If several nodes start concurrently, a quorum of those nodes might
> still be executing the recovery protocol, and hence the coordinator will never
> be elected.
> 
> To address this, add "ignored" responses to the Promise and Write sub-protocols:
> if a proposer sees a quorum of "ignored" responses to a promise or write request
> it has issued, it knows the request will never succeed.  When used for
> coordinator election, the current coding will retry immediately (without a
> backoff).
> 
> Note that replicas will still silently drop promise/write requests if another
> kind of problem occurs (e.g., an I/O error prevents reading/writing log
> data). We might consider changing this, although it will require some thought:
> e.g., if a replica's disk is broken, sending an "ignored" message on every
> request might flood the network.
> 
> CODE REVIEW TO DISCUSS / FIX:
> 
> * Test mock is incredibly ugly: it works, but we clearly need a better approach
>   before committing this. I've been chatting with @tnachen to find a better
>   approach but haven't got anything that works yet.
> 
> * Should we add a backoff when retrying after a failed coordinator election?
> 
> * Should we also send back an "ignored" response if an I/O error occurs?
> 
> 
> Diffs
> -----
> 
>   src/log/consensus.cpp 59f80d02d1d3c11683631f3fc5f6e923b5ebdf96 
>   src/log/coordinator.cpp 5500bca77f3020e0051010c5c178a20a3c7ad44a 
>   src/log/replica.hpp 33d3f1d9e89035936c67739898e73a06b391fcd0 
>   src/log/replica.cpp 75d39ff56822bf00fce9daf5c1e3befb75f2e039 
>   src/messages/log.proto d73b33f865963292af580945659ad0e800f2a204 
>   src/state/log.cpp a75a605a4b0edb8863a3378e2133df7d6eb1cc3d 
>   src/tests/log_tests.cpp f2dd47cfbe73fb18c360a637db009b7d391a782e 
>   src/tests/slave_tests.cpp 10a4fa7eaa8e868ccc6d60ac220d66a4f0a523b4 
> 
> Diff: https://reviews.apache.org/r/39325/diff/
> 
> 
> Testing
> -------
> 
> "make check" passes, including a new test that uses a newly constructed mock to ensure we're testing the message schedule described above.
> 
> I also wrote a script stops and starts mesos-master in a loop, removing the replicated log each time. Without the patch, this occasionally fails with a "registry fetch" timeout; with the patch, you can observe several scenarios where coordinator election is reborted and retried because a quorum of ignored responses is seen. Note that in some cases, we need to retry coordinator election up to ~70 times (!), because we don't currently use a backoff; that should probably be fixed, per comments above. But the important point is that election eventually succeeds and we don't hang.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


--===============8938031120152115857==--