zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael K. Edwards" <m.k.edwa...@gmail.com>
Subject Re: ReconfigInProgress error
Date Sun, 25 Nov 2018 06:11:43 GMT
I don't often admit defeat; but I can't make heads or tails of the
error handling (or lack thereof) in the reconfiguration code paths.
If anybody wants to take a stab at explaining which parts of the
processAck -> tryToCommit -> processReconfig -> reconfigure call chain
should and shouldn't go through if the bind() call fails, maybe I can
try to write tests that verify that and modify the code under test to
behave accordingly.  I've filed ZOOKEEPER-3198 as an umbrella for this
work, and pushed what I've got to
https://github.com/mkedwards/zookeeper/tree/broken-bind-3.5, in case
somebody wants to try to take it forward from there.

In the meantime, I'm running tests in parallel inside a Docker
container (with a code state that has patches applied for all three
3.5 blocker/critical Jiras).  Nothing seems "flaky" yet.  We'll deploy
this in our QA environment next week, and throw some load at it, and
see what happens.  (And run the test suite a few hundred times, too.)

Alex (or anyone else), do you consider any of the other outstanding
Jiras to be obstacles to exercising the reconfiguration features in
3.5.x on a production cluster?  How serious is
https://issues.apache.org/jira/browse/ZOOKEEPER-2202 ?  Is it related
to https://issues.apache.org/jira/browse/ZOOKEEPER-2836 ?  And how
serious is https://issues.apache.org/jira/browse/ZOOKEEPER-1896 ?
Does mixing 3.4.x and 3.5.x in the same cluster work?  Is it best to
disable reconfig while migrating cluster members from 3.4.x to 3.5.x,
and then enable reconfig and do a rolling restart?
On Sat, Nov 24, 2018 at 12:13 PM Alexander Shraer <shralex@gmail.com> wrote:
> Hi Michael,
> In general, one reconfig op is allowed at a time, and this error indicates that one is
already in progress. If there are enough peers to form a quorum a failure to connect to one
of them shouldn’t be a problem. If there is not enough, the leader is supposed to give up
leadership. This is true in general, unrelated to reconfig. A new leader will be elected and
complete any reconfig in progress. That’s the theory at least, there may be a bug in the
case you found.
> Some general flow is described in Sec 3.2 of our paper, https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf
> There are also the wiki docs but they don’t talk about recovery much. https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html
> Btw
> > robustness against
> Byzantine faults that one is led to expect from Zookeeper?
> ZK is not designed to handle Byzantine faults in general. It’s not to say that there
is no bug In the case you found.
> Thanks,
> Alex
> On Sat, Nov 24, 2018 at 11:32 AM Michael K. Edwards <m.k.edwards@gmail.com> wrote:
>> I've been experimenting a bit with trying to propagate failures to
>> bind() server ports in tests up to where we can do something about it.
>> There's at least one category of test cases (callers of
>> ReconfigTest.testPortChangeToBlockedPort) where the server is supposed
>> to ride through a bind() failure, recovering on a subsequent
>> reconfiguration.  In my current code state, I'm encountering errors
>> like this:
>> 2018-11-24 11:04:46,252 [myid:] - INFO  [ProcessThread(sid:3
>> cport:-1)::PrepRequestProcessor@878] - Got user-level KeeperException
>> when processing sessionid:0x1002b98aa830000 type:reconfig cxid:0x1e
>> zxid:0x10000002b txntype:-1 reqpath:n/a Error Path:null
>> Error:KeeperErrorCode = ReconfigInProgress
>> I can hack things until this particular test passes, but it raises
>> questions about reconfiguration in general.  How exactly is the
>> cluster supposed to get out of this state?  If a cluster member drops
>> out of contact with the quorum while there is a reconfiguration in
>> flight, is there any recovery path that restores the ability to
>> process a reconfigure operation?  Is there a design doc for
>> reconfiguration that demonstrates the kind of robustness against
>> Byzantine faults that one is led to expect from Zookeeper?

View raw message