zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
Date Wed, 07 Aug 2019 19:25:06 GMT
On Wed, Aug 7, 2019 at 11:31 AM Karolos Antoniadis <karolos@gmail.com>

> In the paragraph that Michael mentioned, among others it is written: "For
> example, when a partition leader changes its ISR in ZK, the controller will
> typically not learn about these changes for many seconds." Why would it
> take "many seconds"?

I think that this is conflating the situation with partition from or of the
ZK cluster with simple hand-offs.

In the case of a partition leader crashing, it will be several seconds
before the rest of the world hears about the event.

> Sending a watch event to the controller should be
> pretty fast.

Absolutely. If the ZK cluster has its act together. And if the cause of the
watch is detected quickly. And if you don't have a watch storm happening
due to huge numbers of listeners.

But none of those problems are really helped by moving the consensus
algorithms into a library.

Also, in the same paragraph, Colin states "By the time the controller
> re-reads the znode and sets up a new watch, the state may have changed from
> what it was when the watch originally fired.  [...] only way to resolve the
> discrepancy." Why would this lead to any discrepancy? It seems to me that
> the controller, will read an even newer state in such a scenario.

You are correct and this has always been one of the selling points of ZK.
The way that you can reset the watch as part of the read operation means
that you can guarantee never to lose anything and if you are slow to
respond, you always get data that is as up-to-date as possible. Load
shedding tricks like that are really helpful. Getting notifications of
every change is actually disastrous in many cases, partly because of the
number of notifications and partly because the notifications can become
very heavy-weight with the data they have to carry.

Putting this into a library doesn't help at all, of course.

> Also, another argument mentioned in original KIP-500 proposal had to do
> with speeding up the failover of a controller: "Because the controllers
> will now all track the latest state, controller failover will not require a
> lengthy reloading period where we transfer all the state to the new
> controller." But this does not seem to be a problem with ZK per se and
> could be solved by keeping a broker as a standby controller (briefly
> mentioned here https://www.slideshare.net/ConfluentInc/a-deep-dive-into-
> kafka-controller
> <https://www.slideshare.net/ConfluentInc/a-deep-dive-into-kafka-controller>
> as future work.)

Also, the state still has to move. Using an in-process library doesn't
change that at all. It could move via ZK or it could move as part of quorum
decisions or via some sort of follow-the-leader protocol. But it has to
move. Whoever is leader has to write it out to the network and whoever is
follower has to read it in. Whether the data is written/read directly or
via ZK isn't really a big deal.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message