zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Shraer <shra...@gmail.com>
Subject Re: ReconfigInProgress error
Date Sat, 24 Nov 2018 20:13:16 GMT
Hi Michael,

In general, one reconfig op is allowed at a time, and this error indicates
that one is already in progress. If there are enough peers to form a quorum
a failure to connect to one of them shouldn’t be a problem. If there is not
enough, the leader is supposed to give up leadership. This is true in
general, unrelated to reconfig. A new leader will be elected and complete
any reconfig in progress. That’s the theory at least, there may be a bug in
the case you found.

Some general flow is described in Sec 3.2 of our paper,
https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf

There are also the wiki docs but they don’t talk about recovery much.
https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html

Btw

> robustness against
Byzantine faults that one is led to expect from Zookeeper?

ZK is not designed to handle Byzantine faults in general. It’s not to say
that there is no bug In the case you found.

Thanks,
Alex

On Sat, Nov 24, 2018 at 11:32 AM Michael K. Edwards <m.k.edwards@gmail.com>
wrote:

> I've been experimenting a bit with trying to propagate failures to
> bind() server ports in tests up to where we can do something about it.
> There's at least one category of test cases (callers of
> ReconfigTest.testPortChangeToBlockedPort) where the server is supposed
> to ride through a bind() failure, recovering on a subsequent
> reconfiguration.  In my current code state, I'm encountering errors
> like this:
>
> 2018-11-24 11:04:46,252 [myid:] - INFO  [ProcessThread(sid:3
> cport:-1)::PrepRequestProcessor@878] - Got user-level KeeperException
> when processing sessionid:0x1002b98aa830000 type:reconfig cxid:0x1e
> zxid:0x10000002b txntype:-1 reqpath:n/a Error Path:null
> Error:KeeperErrorCode = ReconfigInProgress
>
> I can hack things until this particular test passes, but it raises
> questions about reconfiguration in general.  How exactly is the
> cluster supposed to get out of this state?  If a cluster member drops
> out of contact with the quorum while there is a reconfiguration in
> flight, is there any recovery path that restores the ability to
> process a reconfigure operation?  Is there a design doc for
> reconfiguration that demonstrates the kind of robustness against
> Byzantine faults that one is led to expect from Zookeeper?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message