zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Han <h...@apache.org>
Subject Re: KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
Date Thu, 08 Aug 2019 00:55:34 GMT
Related discussions:
https://lists.apache.org/list.html?dev@kafka.apache.org:lte=1M:KIP-500

>> Why would this lead to any discrepancy? It seems to me that the
controller, will read an even newer state in such a scenario.

I think in this case the Kafka side of expectation is not just be able to
get latest state, but also not miss any of the state changes. See the Kafka
thread ^ which said: *"Treating metadata as a log avoids a lot of the
complex failure corner cases we have seen where a broker misses a single
update sent from the controller, but gets subsequent updates."*

We don't support this in ZK today - clients will always be able to get
latest state, but client might miss watcher events between the first
watcher fired and the reset of the watcher on client side (and any changes
happening in this period will not get notified). The subscribe API can
solve this.

One thing that the proposal only briefly mentioned but I think is a key
differentiator is scalability: managing metadata as event logs will provide
better scalability, as ZK is inherently limited by memory. One can shard a
cluster to work around the limits, but that creates other problems
(consistency, operation, etc). Having an on disk storage backend for ZK's
data tree might solve this.

On Wed, Aug 7, 2019 at 12:25 PM Ted Dunning <ted.dunning@gmail.com> wrote:

> On Wed, Aug 7, 2019 at 11:31 AM Karolos Antoniadis <karolos@gmail.com>
> wrote:
>
> > In the paragraph that Michael mentioned, among others it is written: "For
> > example, when a partition leader changes its ISR in ZK, the controller
> will
> > typically not learn about these changes for many seconds." Why would it
> > take "many seconds"?
>
>
> I think that this is conflating the situation with partition from or of the
> ZK cluster with simple hand-offs.
>
> In the case of a partition leader crashing, it will be several seconds
> before the rest of the world hears about the event.
>
>
> > Sending a watch event to the controller should be
> > pretty fast.
> >
>
> Absolutely. If the ZK cluster has its act together. And if the cause of the
> watch is detected quickly. And if you don't have a watch storm happening
> due to huge numbers of listeners.
>
> But none of those problems are really helped by moving the consensus
> algorithms into a library.
>
> Also, in the same paragraph, Colin states "By the time the controller
> > re-reads the znode and sets up a new watch, the state may have changed
> from
> > what it was when the watch originally fired.  [...] only way to resolve
> the
> > discrepancy." Why would this lead to any discrepancy? It seems to me that
> > the controller, will read an even newer state in such a scenario.
> >
>
> You are correct and this has always been one of the selling points of ZK.
> The way that you can reset the watch as part of the read operation means
> that you can guarantee never to lose anything and if you are slow to
> respond, you always get data that is as up-to-date as possible. Load
> shedding tricks like that are really helpful. Getting notifications of
> every change is actually disastrous in many cases, partly because of the
> number of notifications and partly because the notifications can become
> very heavy-weight with the data they have to carry.
>
> Putting this into a library doesn't help at all, of course.
>
>
>
> >
> > Also, another argument mentioned in original KIP-500 proposal had to do
> > with speeding up the failover of a controller: "Because the controllers
> > will now all track the latest state, controller failover will not
> require a
> > lengthy reloading period where we transfer all the state to the new
> > controller." But this does not seem to be a problem with ZK per se and
> > could be solved by keeping a broker as a standby controller (briefly
> > mentioned here https://www.slideshare.net/ConfluentInc/a-deep-dive-into-
> > kafka-controller
> > <
> https://www.slideshare.net/ConfluentInc/a-deep-dive-into-kafka-controller>
> > as future work.)
> >
>
> Also, the state still has to move. Using an in-process library doesn't
> change that at all. It could move via ZK or it could move as part of quorum
> decisions or via some sort of follow-the-leader protocol. But it has to
> move. Whoever is leader has to write it out to the network and whoever is
> follower has to read it in. Whether the data is written/read directly or
> via ZK isn't really a big deal.
>
>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message