zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Han <h...@apache.org>
Subject Re: KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
Date Fri, 02 Aug 2019 23:01:07 GMT
Very well said, thank you Ted!

>> I would still opt for quorum outside rather than quorum as a library.

One observation on out side quorum vs library: for Raft, cockroach db and
TiDB both choose the library approach instead of depending on etcd, though
they all share the etcd's Raft implementation. ZooKeeper could be used in a
similar approach if we can abstract ZAB and provides a nice SMR interface
on top of it.

On Fri, Aug 2, 2019 at 12:44 PM Ted Dunning <ted.dunning@gmail.com> wrote:

> The core issue in these situations in my experience is that having the
> quorum as a separate service can be a pain point. This misunderstanding
> about how watches work and why they don't provide the data is just a
> symptom of this. Having an integrated quorum is very attractive from the
> point of view of management and tighter integration with the record of
> state.
>
> If I had it all to do over again, though, I think I would still opt for
> quorum outside rather than quorum as a library. There are management
> burdens, but many of those management burdens are implicit in the fact that
> managing the state of the system is different from managing the system or
> doing the stuff the system does. Pulling the quorum system into the
> do-stuff system doesn't actually make life all that much easier even if it
> does simplify the installer.
>
> The countervailing risk that you are likely to get a quorum system wrong is
> really significant. Having a battle-tested (some might say battle-scarred)
> system like ZK is quite a virtue since you can have a different level of
> confidence in it than something you whipped up last week.
>
>
>
> On Fri, Aug 2, 2019 at 11:49 AM Patrick Hunt <phunt@apache.org> wrote:
>
> > Michael I think you are describing subscribe - this?
> > https://issues.apache.org/jira/browse/ZOOKEEPER-153
> > wasn't there some work done to keep tlogs around for a while? Or am I
> miss
> > remembering? (fb folks?)
> >
> > I'll also add that we haven't done any benchmarking in quite some time.
> It
> > would be interesting to collect a few of these use cases from the
> > community, esp downstreams, and evaluate performance, see if we can
> > address.
> >
> > Patrick
> >
> > On Fri, Aug 2, 2019 at 11:03 AM Michael Han <hanm@apache.org> wrote:
> >
> > > Folks,
> > >
> > > Some of you might already see this. Comments?
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
> > >
> > >
> > > What caught my eyes are:
> > >
> > > *Worse still, although ZooKeeper is the store of record, the state in
> > > ZooKeeper often doesn't match the state that is held in memory in the
> > > controller.  For example, when a partition leader changes its ISR in
> ZK,
> > > the controller will typically not learn about these changes for many
> > > seconds.  There is no generic way for the controller to follow the
> > > ZooKeeper event log.  Although the controller can set one-shot watches,
> > the
> > > number of watches is limited for performance reasons.  When a watch
> > > triggers, it doesn't tell the controller the current state-- only that
> > the
> > > state has changed.  By the time the controller re-reads the znode and
> > sets
> > > up a new watch, the state may have changed from what it was when the
> > watch
> > > originally fired.  If there is no watch set, the controller may not
> learn
> > > about the change at all.  In some cases, restarting the controller is
> the
> > > only way to resolve the discrepancy.*
> > >
> > > I've seen some similar zookeeper use cases that ended up like what's
> > > described here. How can ZooKeeper solve this? It seems to me that the
> > only
> > > solution is to provide linearizable read on watched operations.
> Thoughts?
> > >
> > > Michael.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message