kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin McCabe" <cmcc...@apache.org>
Subject Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
Date Tue, 03 Sep 2019 17:04:47 GMT
On Mon, Sep 2, 2019, at 07:51, Ron Dagostino wrote:
> Hi Colin.  It is not unusual for customers to wait before upgrading — 
> to avoid so-called “point-zero” releases — to avoid as many of the 
> inevitable bugs that ride along with new functionality as possible.  
> Removal of Zookeeper is going feel especially risky to these customers, 
> and arguably it is going to feel risky even to customers who might 
> otherwise be less sensitive to upgrade risk.
> 
> This leads me to believe it is reasonable to expect that the uptake of 
> the new ZK-less consensus quorum could be delayed in many installations 
> — that such customers might wait longer than usual to adopt the feature 
> and abandon their Zookeeper servers.
> 
> Will it be possible to use releases beyond the bridge release and not 
> abandon Zookeeper?  For example, what would happen if post-bridge the 
> new consensus quorum servers are never started?  Would Kafka still work 
> fine?  At what point MUST Zookeeper be abandoned?  Taking the 
> perspective of the above customers, I think they would prefer to have 
> others adopt the new ZK-less consensus quorum for several months and 
> encounter many of the inevitable bugs before adopting it themselves.  
> But at the same time they will not want to be stuck on the bridge 
> release that whole time because there are going to be both bug fixes 
> and new features that they will want to take advantage of.
> 
> If the bridge release is the last one that supports Zookeeper, and if 
> some customers stay on that release for a while, then I could see those 
> customers wanting back-ports of bug fixes and features to occur for a 
> period of time that extends beyond what is normally done.
> 
> Basically, to sum all of the above up, I think there is a reasonable 
> probability that a single bridge release only could become a potential 
> barrier that causes angst for the project and the community.
> 
> I wonder if it would be in the interest of the project and the 
> community to mitigate the risk of there being a bridge release barrier 
> by extending the time when ZK would still be supported — perhaps for up 
> to a year — and the new co send us quorum could remain optional.
> 
> Ron
> 

Hi Ron,

Changing things always involves risk.  This is why we are trying to do as much as we can incrementally.  For example, removing ZK dependencies from tools, and from brokers.  However, there are things that can't really be done incrementally, and one of these is switching over to a new metadata store.

It might seem like supporting multiple options for where to store metadata would be safer somehow, but actually this is not the case.  Having to support totally different code paths involves a lot more work and a lot more testing.  We already have configurations that aren't tested enough.  Doubling (at least) the number of configurations we have to test is a non-starter.

This also ties in with the discussion in the KIP about why we don't plan on supporting pluggable consensus or pluggable metadata storage.  Doing this would force us to use only the least-common denominator features of every metadata storage.  We would not be able to take advantage of metadata as a stream of events, or any of the features that ZK doesn't have.  Configuration would also be quite complex.

As the KIP states, the upgrade from a bridge release (there may be several bridge releases) to ZK will have no impact on clients.  It also won't have any impact on cluster sizing (ZK nodes will simply become controller nodes).  And it will be possible to do with a rolling upgrade.  I agree that some people may be nervous about running the new software, and we may want to have more point releases of the older branches.

This is something that we'll discuss when people propose release schedules.  In general, this isn't fundamentally different than someone wanting a new release of 1.x because they don't want to upgrade to 2.x.  If there's enough interest, we'll do it.

best,
Colin

> 
> > On Aug 26, 2019, at 6:55 PM, Colin McCabe <cmccabe@apache.org> wrote:
> > 
> > Hi Ryanne,
> > 
> > Good point.  I added a section titled "future work" with information about the follow-on KIPs that we discussed here.
> > 
> > best,
> > Colin
> > 
> > 
> >> On Fri, Aug 23, 2019, at 13:15, Ryanne Dolan wrote:
> >> Thanks Colin, sgtm. Please make this clear in the KIP -- otherwise it is
> >> hard to nail down what we are voting for.
> >> 
> >> Ryanne
> >> 
> >> 
> >>> On Fri, Aug 23, 2019, 12:58 PM Colin McCabe <cmccabe@apache.org> wrote:
> >>> 
> >>>> On Fri, Aug 23, 2019, at 06:24, Ryanne Dolan wrote:
> >>>> Colin, can you outline what specifically would be in scope for this KIP
> >>> vs
> >>>> deferred to the follow-on KIPs you've mentioned? Maybe a Future Work
> >>>> section? Is the idea to get to the bridge release with this KIP, and then
> >>>> go from there?
> >>>> 
> >>>> Ryanne
> >>>> 
> >>> 
> >>> Hi Ryanne,
> >>> 
> >>> The goal for KIP-500 is to set out an overall vision for how we will
> >>> remove ZooKeeper and transition to managing metadata via a controller
> >>> quorum.
> >>> 
> >>> We will create follow-on KIPs that will lay out the specific details of
> >>> each step.
> >>> 
> >>> * A KIP for allowing kafka-configs.sh to change topic configurations
> >>> without using ZooKeeper.  (It can already change broker configurations
> >>> without ZK)
> >>> 
> >>> * A KIP for adding APIs to replace direct ZK access by the brokers.
> >>> 
> >>> * A KIP to describe Raft replication in Kafka, including the overall
> >>> protocol, details of each RPC, etc.
> >>> 
> >>> * A KIP describing the controller changes, how metadata is stored, etc.
> >>> 
> >>> There may be other KIPs that we need (for example, if we find another tool
> >>> that still has a hard ZK dependency), but that's the general idea.  KIP-500
> >>> is about the overall design-- the follow on KIPs are about the specific
> >>> details.
> >>> 
> >>> best,
> >>> Colin
> >>> 
> >>> 
> >>>> 
> >>>>> On Thu, Aug 22, 2019, 11:58 AM Colin McCabe <cmccabe@apache.org> wrote:
> >>>>> 
> >>>>>> On Wed, Aug 21, 2019, at 19:48, Ron Dagostino wrote:
> >>>>>> Thanks, Colin.  The changes you made to the KIP related to the bridge
> >>>>>> release help make it clearer.  I still have some confusion about the
> >>>>> phrase
> >>>>>> "The rolling upgrade from the bridge release will take several
> >>> steps."
> >>>>>> This made me think you are talking about moving from the bridge
> >>> release
> >>>>> to
> >>>>>> some other, newer, release that comes after the bridge release.  But
> >>> I
> >>>>>> think what you are getting at is that the bridge release can be run
> >>> with
> >>>>> or
> >>>>>> without Zookeeper -- when first upgrading to it Zookeeper remains in
> >>> use,
> >>>>>> but then there is a transition that can be made to engage the warp
> >>>>> drive...
> >>>>>> I mean the Controller Quorum.  So maybe the phrase should be "The
> >>> rolling
> >>>>>> upgrade through the bridge release -- starting with Zookeeper being
> >>> in
> >>>>> use
> >>>>>> and ending with Zookeeper having been replaced by the Controller
> >>> Quorum
> >>>>> --
> >>>>>> will take several steps."
> >>>>> 
> >>>>> Hi Ron,
> >>>>> 
> >>>>> To clarify, the bridge release will require ZooKeeper.  It will also
> >>> not
> >>>>> support the controller quorum.  It's a bridge in the sense that you
> >>> must
> >>>>> upgrade to a bridge release prior to upgrading to a ZK-less release.  I
> >>>>> added some more descriptive text to the bridge release paragraph--
> >>>>> hopefully this makes it clearer.
> >>>>> 
> >>>>> best,
> >>>>> Colin
> >>>>> 
> >>>>>> 
> >>>>>> Do I understand it correctly, and might some change in phrasing or
> >>>>>> additional clarification help others avoid the same confusion I had?
> >>>>>> 
> >>>>>> Ron
> >>>>>> 
> >>>>>> On Wed, Aug 21, 2019 at 2:31 PM Colin McCabe <cmccabe@apache.org>
> >>> wrote:
> >>>>>> 
> >>>>>>>> On Wed, Aug 21, 2019, at 04:22, Ron Dagostino wrote:
> >>>>>>>> Hi Colin.  I like the concept of a "bridge release" for migrating
> >>>>> off of
> >>>>>>>> Zookeeper, but I worry that it may become a bottleneck if people
> >>>>> hesitate
> >>>>>>>> to replace Zookeeper -- they would be unable to adopt newer
> >>> versions
> >>>>> of
> >>>>>>>> Kafka until taking (what feels to them like) a giant leap.  As an
> >>>>>>> example,
> >>>>>>>> assuming version 4.0.x of Kafka is the supported bridge release,
> >>> I
> >>>>> would
> >>>>>>>> not be surprised if uptake of the 4.x release and the time-based
> >>>>> releases
> >>>>>>>> that follow it end up being much slower due to the perceived
> >>> barrier.
> >>>>>>>> 
> >>>>>>>> Any perceived barrier could be lowered if the 4.0.x release could
> >>>>>>>> optionally continue to use Zookeeper -- then the cutover would
> >>> be two
> >>>>>>>> incremental steps (move to 4.0.x, then replace Zookeeper while
> >>>>> staying on
> >>>>>>>> 4.0.x) as opposed to a single big-bang (upgrade to 4.0.x and
> >>> replace
> >>>>>>>> Zookeeper in one fell swoop).
> >>>>>>> 
> >>>>>>> Hi Ron,
> >>>>>>> 
> >>>>>>> Just to clarify, the "bridge release" will continue to use
> >>> ZooKeeper.
> >>>>> It
> >>>>>>> will not support running without ZooKeeper.  It is the releases
> >>> that
> >>>>> follow
> >>>>>>> the bridge release that will remove ZooKeeper.
> >>>>>>> 
> >>>>>>> Also, it's a bit unclear whether the bridge release would be 3.x or
> >>>>> 4.x,
> >>>>>>> or something to follow.  We do know that the bridge release can't
> >>> be a
> >>>>> 2.x
> >>>>>>> release, since it requires at least one incompatible change,
> >>> removing
> >>>>>>> --zookeeper options from all the shell scripts.  (Since we're doing
> >>>>>>> semantic versioning, any time we make an incompatible change, we
> >>> bump
> >>>>> the
> >>>>>>> major version number.)
> >>>>>>> 
> >>>>>>> In general, using two sources of metadata is a lot more complex and
> >>>>>>> error-prone than one.  A lot of the bugs and corner cases we have
> >>> are
> >>>>> the
> >>>>>>> result of divergences between the controller and the state in
> >>>>> ZooKeeper.
> >>>>>>> Eliminating this divergence, and the split-brain scenarios it
> >>> creates,
> >>>>> is a
> >>>>>>> major goal of this work.
> >>>>>>> 
> >>>>>>>> 
> >>>>>>>> Regardless of whether what I wrote above has merit or not, I
> >>> think
> >>>>> the
> >>>>>>> KIP
> >>>>>>>> should be more explicit about what the upgrade constraints
> >>> actually
> >>>>> are.
> >>>>>>>> Can the bridge release be adopted with Zookeeper remaining in
> >>> place
> >>>>> and
> >>>>>>>> then cutting over as a second, follow-on step, or must the
> >>> Controller
> >>>>>>>> Quorum nodes be started first and the bridge release cannot be
> >>> used
> >>>>> with
> >>>>>>>> Zookeeper at all?
> >>>>>>> 
> >>>>>>> As I mentioned above, the bridge release supports (indeed,
> >>> requires)
> >>>>>>> ZooKeeper.  I have added a little more text about this to KIP-500
> >>> which
> >>>>>>> hopefully makes it clearer.
> >>>>>>> 
> >>>>>>> best,
> >>>>>>> Colin
> >>>>>>> 
> >>>>>>>> If the bridge release cannot be used with Zookeeper at
> >>>>>>>> all, then no version at or beyond the bridge release is available
> >>>>>>>> unless/until abandoning Zookeeper; if the bridge release can be
> >>> used
> >>>>> with
> >>>>>>>> Zookeeper, then is it the only version that can be used with
> >>>>> Zookeeper,
> >>>>>>> or
> >>>>>>>> can Zookeeper be kept for additional releases if desired?
> >>>>>>>> 
> >>>>>>>> Ron
> >>>>>>>> 
> >>>>>>>> On Tue, Aug 20, 2019 at 10:19 AM Ron Dagostino <
> >>> rndgstn@gmail.com>
> >>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> Hi Colin.  The diagram up at the top confused me --
> >>> specifically,
> >>>>> the
> >>>>>>>>> lines connecting the controller/active-controller to the
> >>> brokers.
> >>>>> I
> >>>>>>> had
> >>>>>>>>> assumed the arrows on those lines represented the direction of
> >>> data
> >>>>>>> flow,
> >>>>>>>>> but that is not the case; the arrows actually identify the
> >>> target
> >>>>> of
> >>>>>>> the
> >>>>>>>>> action, and the non-arrowed end indicates the initiator of the
> >>>>>>> action.  For
> >>>>>>>>> example, the lines point from the controller to the brokers in
> >>> the
> >>>>>>> "today"
> >>>>>>>>> section on the left to show that the controller pushes to the
> >>>>> brokers;
> >>>>>>> the
> >>>>>>>>> lines point from the brokers to the active-controller in the
> >>>>> "tomorrow"
> >>>>>>>>> section on the right to show that the brokers pull from the
> >>>>>>>>> active-controller.  As I said, this confused me because my gut
> >>>>>>> instinct was
> >>>>>>>>> to interpret the arrow as indicating the direction of data
> >>> flow,
> >>>>> and
> >>>>>>> when I
> >>>>>>>>> look at the "tomorrow" picture on the right I initially thought
> >>>>>>> information
> >>>>>>>>> was moving from the brokers to the active-controller.  Did you
> >>>>> consider
> >>>>>>>>> drawing that picture with the arrows reversed in the "tomorrow"
> >>>>> side so
> >>>>>>>>> that the arrows represent the direction of data flow, and then
> >>> add
> >>>>> the
> >>>>>>>>> labels "push" on the "today" side and "pull" on the "tomorrow"
> >>>>> side to
> >>>>>>>>> indicate who initiates the data flow?  It occurs to me that
> >>> this
> >>>>>>> picture
> >>>>>>>>> may end up being widely distributed, so it might be in
> >>> everyone's
> >>>>>>> interest
> >>>>>>>>> to proactively avoid any possible confusion by being more
> >>> explicit.
> >>>>>>>>> 
> >>>>>>>>> Minor corrections?
> >>>>>>>>> <<<In the current world, a broker which can contact ZooKeeper
> >>> but
> >>>>> which
> >>>>>>>>> is partitioned from the active controller
> >>>>>>>>>>>> In the current world, a broker which can contact ZooKeeper
> >>> but
> >>>>> which
> >>>>>>>>> is partitioned from the controller
> >>>>>>>>> 
> >>>>>>>>> <<<Eventually, the controller will ask the broker to finally go
> >>>>> offline
> >>>>>>>>>>>> Eventually, the active controller will ask the broker to
> >>>>> finally go
> >>>>>>>>> offline
> >>>>>>>>> 
> >>>>>>>>> <<<New versions of the clients should send these operations
> >>>>> directly to
> >>>>>>>>> the controller
> >>>>>>>>>>>> New versions of the clients should send these operations
> >>>>> directly to
> >>>>>>>>> the active controller
> >>>>>>>>> 
> >>>>>>>>> <<<In the post-ZK world, the leader will make an RPC to the
> >>>>> controller
> >>>>>>>>> instead
> >>>>>>>>>>>> In the post-ZK world, the leader will make an RPC to the
> >>> active
> >>>>>>>>> controller instead
> >>>>>>>>> 
> >>>>>>>>> <<<For example, the brokers may need to forward their requests
> >>> to
> >>>>> the
> >>>>>>>>> controller.
> >>>>>>>>>>>> For example, the brokers may need to forward their requests
> >>> to
> >>>>> the
> >>>>>>>>> active controller.
> >>>>>>>>> 
> >>>>>>>>> <<<The new controller will monitor ZooKeeper for legacy broker
> >>> node
> >>>>>>>>> registrations
> >>>>>>>>>>>> The new (active) controller will monitor ZooKeeper for
> >>> legacy
> >>>>> broker
> >>>>>>>>> node registrations
> >>>>>>>>> 
> >>>>>>>>> Ron
> >>>>>>>>> 
> >>>>>>>>> On Mon, Aug 19, 2019 at 6:53 PM Colin McCabe <
> >>> cmccabe@apache.org>
> >>>>>>> wrote:
> >>>>>>>>> 
> >>>>>>>>>> Hi all,
> >>>>>>>>>> 
> >>>>>>>>>> The KIP has been out for a while, so I'm thinking about
> >>> calling a
> >>>>> vote
> >>>>>>>>>> some time this week.
> >>>>>>>>>> 
> >>>>>>>>>> best,
> >>>>>>>>>> Colin
> >>>>>>>>>> 
> >>>>>>>>>>> On Mon, Aug 19, 2019, at 15:52, Colin McCabe wrote:
> >>>>>>>>>>>> On Mon, Aug 19, 2019, at 12:52, David Arthur wrote:
> >>>>>>>>>>>> Thanks for the KIP, Colin. This looks great!
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I really like the idea of separating the Controller and
> >>> Broker
> >>>>>>> JVMs.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> As you alluded to above, it might be nice to have a
> >>> separate
> >>>>>>>>>>>> broker-registration API to avoid overloading the metadata
> >>>>> fetch
> >>>>>>> API.
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> Hi David,
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks for taking a look.
> >>>>>>>>>>> 
> >>>>>>>>>>> I removed the sentence about MetadataFetch also serving as
> >>> the
> >>>>>>> broker
> >>>>>>>>>>> registration API.  I think I agree that we will probably
> >>> want a
> >>>>>>>>>>> separate RPC to fill this role.  We will have a follow-on
> >>> KIP
> >>>>> that
> >>>>>>> will
> >>>>>>>>>>> go into more detail about metadata propagation and
> >>> registration
> >>>>> in
> >>>>>>> the
> >>>>>>>>>>> post-ZK world.  That KIP will also have a full description
> >>> of
> >>>>> the
> >>>>>>>>>>> registration RPC, etc.  For now, I think the important part
> >>> for
> >>>>>>> KIP-500
> >>>>>>>>>>> is that the broker registers with the controller quorum.  On
> >>>>>>>>>>> registration, the controller quorum assigns it a new broker
> >>>>> epoch,
> >>>>>>>>>>> which can distinguish successive broker incarnations.
> >>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> When a broker gets a metadata delta, will it be a
> >>> sequence of
> >>>>>>> deltas
> >>>>>>>>>> since
> >>>>>>>>>>>> the last update or a cumulative delta since the last
> >>> update?
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> It will be a sequence of deltas.  Basically, the broker
> >>> will be
> >>>>>>> reading
> >>>>>>>>>>> from the metadata log.
> >>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Will we include any kind of integrity check on the deltas
> >>> to
> >>>>>>> ensure
> >>>>>>>>>> the brokers
> >>>>>>>>>>>> have applied them correctly? Perhaps this will be
> >>> addressed in
> >>>>>>> one of
> >>>>>>>>>> the
> >>>>>>>>>>>> follow-on KIPs.
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> In general, we will have checksums on the metadata that we
> >>>>> fetch.
> >>>>>>> This
> >>>>>>>>>>> is similar to how we have checksums on regular data.  Or if
> >>> the
> >>>>>>>>>>> question is about catching logic errors in the metadata
> >>> handling
> >>>>>>> code,
> >>>>>>>>>>> that sounds more like something that should be caught by
> >>> test
> >>>>> cases.
> >>>>>>>>>>> 
> >>>>>>>>>>> best,
> >>>>>>>>>>> Colin
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>>> Thanks!
> >>>>>>>>>>>> 
> >>>>>>>>>>>> On Fri, Aug 9, 2019 at 1:17 PM Colin McCabe <
> >>>>> cmccabe@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>> 
> >>>>>>>>>>>>> Hi Mickael,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Thanks for taking a look.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> I don't think we want to support that kind of
> >>> multi-tenancy
> >>>>> at
> >>>>>>> the
> >>>>>>>>>>>>> controller level.  If the cluster is small enough that
> >>> we
> >>>>> want
> >>>>>>> to
> >>>>>>>>>> pack the
> >>>>>>>>>>>>> controller(s) with something else, we could run them
> >>>>> alongside
> >>>>>>> the
> >>>>>>>>>> brokers,
> >>>>>>>>>>>>> or possibly inside three of the broker JVMs.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> best,
> >>>>>>>>>>>>> Colin
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On Wed, Aug 7, 2019, at 10:37, Mickael Maison wrote:
> >>>>>>>>>>>>>> Thank Colin for kickstarting this initiative.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Just one question.
> >>>>>>>>>>>>>> - A nice feature of Zookeeper is the ability to use
> >>>>> chroots
> >>>>>>> and
> >>>>>>>>>> have
> >>>>>>>>>>>>>> several Kafka clusters use the same Zookeeper
> >>> ensemble. Is
> >>>>>>> this
> >>>>>>>>>>>>>> something we should keep?
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On Mon, Aug 5, 2019 at 7:44 PM Colin McCabe <
> >>>>>>> cmccabe@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> On Mon, Aug 5, 2019, at 10:02, Tom Bentley wrote:
> >>>>>>>>>>>>>>>> Hi Colin,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Thanks for the KIP.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Currently ZooKeeper provides a convenient
> >>> notification
> >>>>>>>>>> mechanism for
> >>>>>>>>>>>>>>>> knowing that broker and topic configuration has
> >>>>> changed.
> >>>>>>> While
> >>>>>>>>>>>>> KIP-500 does
> >>>>>>>>>>>>>>>> suggest that incremental metadata update is
> >>> expected
> >>>>> to
> >>>>>>> come
> >>>>>>>>>> to
> >>>>>>>>>>>>> clients
> >>>>>>>>>>>>>>>> eventually, that would seem to imply that for some
> >>>>> number
> >>>>>>> of
> >>>>>>>>>>>>> releases there
> >>>>>>>>>>>>>>>> would be no equivalent mechanism for knowing about
> >>>>> config
> >>>>>>>>>> changes.
> >>>>>>>>>>>>> Is there
> >>>>>>>>>>>>>>>> any thinking at this point about how a similar
> >>>>>>> notification
> >>>>>>>>>> might be
> >>>>>>>>>>>>>>>> provided in the future?
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> We could eventually have some inotify-like mechanism
> >>>>> where
> >>>>>>>>>> clients
> >>>>>>>>>>>>> could register interest in various types of events and
> >>> got
> >>>>>>> notified
> >>>>>>>>>> when
> >>>>>>>>>>>>> they happened.  Reading the metadata log is conceptually
> >>>>> simple.
> >>>>>>>>>> The main
> >>>>>>>>>>>>> complexity would be in setting up an API that made
> >>> sense and
> >>>>>>> that
> >>>>>>>>>> didn't
> >>>>>>>>>>>>> unduly constrain future implementations.  We'd have to
> >>> think
> >>>>>>>>>> carefully
> >>>>>>>>>>>>> about what the real use-cases for this were, though.
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> best,
> >>>>>>>>>>>>>>> Colin
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Tom
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> On Mon, Aug 5, 2019 at 3:49 PM Viktor
> >>> Somogyi-Vass <
> >>>>>>>>>>>>> viktorsomogyi@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> Hey Colin,
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> I think this is a long-awaited KIP, thanks for
> >>>>> driving
> >>>>>>> it.
> >>>>>>>>>> I'm
> >>>>>>>>>>>>> excited to
> >>>>>>>>>>>>>>>>> see this in Kafka once. I collected my questions
> >>>>> (and I
> >>>>>>>>>> accept the
> >>>>>>>>>>>>> "TBD"
> >>>>>>>>>>>>>>>>> answer as they might be a bit deep for this high
> >>>>> level
> >>>>>>> :) ).
> >>>>>>>>>>>>>>>>> 1.) Are there any specific reasons for the
> >>>>> Controller
> >>>>>>> just
> >>>>>>>>>>>>> periodically
> >>>>>>>>>>>>>>>>> persisting its state on disk periodically
> >>> instead of
> >>>>>>>>>>>>> asynchronously with
> >>>>>>>>>>>>>>>>> every update? Wouldn't less frequent saves
> >>> increase
> >>>>> the
> >>>>>>>>>> chance for
> >>>>>>>>>>>>> missing
> >>>>>>>>>>>>>>>>> a state change if the controller crashes
> >>> between two
> >>>>>>> saves?
> >>>>>>>>>>>>>>>>> 2.) Why can't we allow brokers to fetch metadata
> >>>>> from
> >>>>>>> the
> >>>>>>>>>> follower
> >>>>>>>>>>>>>>>>> controllers? I assume that followers would have
> >>>>>>> up-to-date
> >>>>>>>>>>>>> information
> >>>>>>>>>>>>>>>>> therefore brokers could fetch from there in
> >>> theory.
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> Viktor
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> On Sun, Aug 4, 2019 at 6:58 AM Boyang Chen <
> >>>>>>>>>>>>> reluctanthero104@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> Thanks for explaining Ismael! Breaking down
> >>> into
> >>>>>>>>>> follow-up KIPs
> >>>>>>>>>>>>> sounds
> >>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>> a good idea.
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> On Sat, Aug 3, 2019 at 10:14 AM Ismael Juma <
> >>>>>>>>>> ismael@juma.me.uk>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> Hi Boyang,
> >>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> Yes, there will be several KIPs that will
> >>>>> discuss
> >>>>>>> the
> >>>>>>>>>> items you
> >>>>>>>>>>>>>>>>> describe
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>> detail. Colin, it may be helpful to make
> >>> this
> >>>>> clear
> >>>>>>> in
> >>>>>>>>>> the KIP
> >>>>>>>>>>>>> 500
> >>>>>>>>>>>>>>>>>>> description.
> >>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> Ismael
> >>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen <
> >>>>>>>>>>>>> reluctanthero104@gmail.com
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>> Thanks Colin for initiating this important
> >>>>> effort!
> >>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>> One question I have is whether we have a
> >>>>> session
> >>>>>>>>>> discussing
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> controller
> >>>>>>>>>>>>>>>>>>>> failover in the new architecture? I know
> >>> we
> >>>>> are
> >>>>>>> using
> >>>>>>>>>> Raft
> >>>>>>>>>>>>> protocol
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> failover, yet it's still valuable to
> >>> discuss
> >>>>> the
> >>>>>>>>>> steps new
> >>>>>>>>>>>>> cluster is
> >>>>>>>>>>>>>>>>>>> going
> >>>>>>>>>>>>>>>>>>>> to take to reach the stable stage again,
> >>> so
> >>>>> that
> >>>>>>> we
> >>>>>>>>>> could
> >>>>>>>>>>>>> easily
> >>>>>>>>>>>>>>>>>> measure
> >>>>>>>>>>>>>>>>>>>> the availability of the metadata servers.
> >>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>> Another suggestion I have is to write a
> >>>>>>> step-by-step
> >>>>>>>>>> design
> >>>>>>>>>>>>> doc like
> >>>>>>>>>>>>>>>>>> what
> >>>>>>>>>>>>>>>>>>>> we did in KIP-98
> >>>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>> 
> >>>>> 
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
> >>>>>>>>>>>>>>>>>>>>> ,
> >>>>>>>>>>>>>>>>>>>> including the new request protocols and
> >>> how
> >>>>> they
> >>>>>>> are
> >>>>>>>>>>>>> interacting in
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>> cluster. For a complicated change like
> >>> this,
> >>>>> an
> >>>>>>>>>>>>> implementation design
> >>>>>>>>>>>>>>>>>> doc
> >>>>>>>>>>>>>>>>>>>> help a lot in the review process,
> >>> otherwise
> >>>>> most
> >>>>>>>>>> discussions
> >>>>>>>>>>>>> we have
> >>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>> focus on high level and lose important
> >>>>> details as
> >>>>>>> we
> >>>>>>>>>>>>> discover them in
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> post-agreement phase.
> >>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>> Boyang
> >>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>> On Fri, Aug 2, 2019 at 5:17 PM Colin
> >>> McCabe <
> >>>>>>>>>>>>> cmccabe@apache.org>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> On Fri, Aug 2, 2019, at 16:33, Jose
> >>> Armando
> >>>>>>> Garcia
> >>>>>>>>>> Sancio
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>> Thanks Colin for the detail KIP. I
> >>> have a
> >>>>> few
> >>>>>>>>>> comments
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> questions.
> >>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> In the KIP's Motivation and Overview
> >>> you
> >>>>>>>>>> mentioned the
> >>>>>>>>>>>>>>>>> LeaderAndIsr
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>> UpdateMetadata RPC. For example,
> >>> "updates
> >>>>>>> which
> >>>>>>>>>> the
> >>>>>>>>>>>>> controller
> >>>>>>>>>>>>>>>>>>> pushes,
> >>>>>>>>>>>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>>>>>>> as LeaderAndIsr and UpdateMetadata
> >>>>> messages".
> >>>>>>> Is
> >>>>>>>>>> your
> >>>>>>>>>>>>> thinking
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>> use MetadataFetch as a replacement to
> >>> just
> >>>>>>>>>>>>> UpdateMetadata only
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> add
> >>>>>>>>>>>>>>>>>>>>>> topic configuration in this state?
> >>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> Hi Jose,
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> Thanks for taking a look.
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> The goal is for MetadataFetchRequest to
> >>>>> replace
> >>>>>>> both
> >>>>>>>>>>>>>>>>>>> LeaderAndIsrRequest
> >>>>>>>>>>>>>>>>>>>>> and UpdateMetadataRequest.  Topic
> >>>>> configurations
> >>>>>>>>>> would be
> >>>>>>>>>>>>> fetched
> >>>>>>>>>>>>>>>>>> along
> >>>>>>>>>>>>>>>>>>>>> with the other metadata.
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> In the section "Broker Metadata
> >>>>> Management",
> >>>>>>> you
> >>>>>>>>>> mention
> >>>>>>>>>>>>> "Just
> >>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> fetch request, the broker will track
> >>> the
> >>>>>>> offset
> >>>>>>>>>> of the
> >>>>>>>>>>>>> last
> >>>>>>>>>>>>>>>>> updates
> >>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>> fetched". To keep the log consistent
> >>> Raft
> >>>>>>>>>> requires that
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> followers
> >>>>>>>>>>>>>>>>>>>>> keep
> >>>>>>>>>>>>>>>>>>>>>> all of the log entries (term/epoch and
> >>>>> offset)
> >>>>>>>>>> that are
> >>>>>>>>>>>>> after the
> >>>>>>>>>>>>>>>>>>>>>> highwatermark. Any log entry before
> >>> the
> >>>>>>>>>> highwatermark
> >>>>>>>>>>>>> can be
> >>>>>>>>>>>>>>>>>>>>>> compacted/snapshot. Do we expect the
> >>>>>>>>>> MetadataFetch API
> >>>>>>>>>>>>> to only
> >>>>>>>>>>>>>>>>>> return
> >>>>>>>>>>>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>> entries up to the highwatermark?
> >>> Unlike
> >>>>> the
> >>>>>>> Raft
> >>>>>>>>>>>>> replication API
> >>>>>>>>>>>>>>>>>>> which
> >>>>>>>>>>>>>>>>>>>>>> will replicate/fetch log entries
> >>> after the
> >>>>>>>>>> highwatermark
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> consensus?
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> Good question.  Clearly, we shouldn't
> >>> expose
> >>>>>>>>>> metadata
> >>>>>>>>>>>>> updates to
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> brokers until they've been stored on a
> >>>>> majority
> >>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>>> Raft nodes.
> >>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>> most obvious way to do that, like you
> >>>>>>> mentioned, is
> >>>>>>>>>> to
> >>>>>>>>>>>>> have the
> >>>>>>>>>>>>>>>>>> brokers
> >>>>>>>>>>>>>>>>>>>>> only fetch up to the HWM, but not
> >>> beyond.
> >>>>> There
> >>>>>>>>>> might be
> >>>>>>>>>>>>> a more
> >>>>>>>>>>>>>>>>>> clever
> >>>>>>>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>>>>>> to do it by fetching the data, but not
> >>>>> having
> >>>>>>> the
> >>>>>>>>>> brokers
> >>>>>>>>>>>>> act on it
> >>>>>>>>>>>>>>>>>>> until
> >>>>>>>>>>>>>>>>>>>>> the HWM advances.  I'm not sure if
> >>> that's
> >>>>> worth
> >>>>>>> it
> >>>>>>>>>> or
> >>>>>>>>>>>>> not.  We'll
> >>>>>>>>>>>>>>>>>>> discuss
> >>>>>>>>>>>>>>>>>>>>> this more in a separate KIP that just
> >>>>> discusses
> >>>>>>>>>> just Raft.
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> In section "Broker Metadata
> >>> Management",
> >>>>> you
> >>>>>>>>>> mention "the
> >>>>>>>>>>>>>>>>>> controller
> >>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>> send a full metadata image rather
> >>> than a
> >>>>>>> series of
> >>>>>>>>>>>>> deltas". This
> >>>>>>>>>>>>>>>>>> KIP
> >>>>>>>>>>>>>>>>>>>>>> doesn't go into the set of operations
> >>> that
> >>>>>>> need
> >>>>>>>>>> to be
> >>>>>>>>>>>>> supported
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> top
> >>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>> Raft but it would be interested if
> >>> this
> >>>>> "full
> >>>>>>>>>> metadata
> >>>>>>>>>>>>> image"
> >>>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> express also as deltas. For example,
> >>>>> assuming
> >>>>>>> we
> >>>>>>>>>> are
> >>>>>>>>>>>>> replicating
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> map
> >>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>> "full metadata image" could be a
> >>> sequence
> >>>>> of
> >>>>>>> "put"
> >>>>>>>>>>>>> operations
> >>>>>>>>>>>>>>>>>> (znode
> >>>>>>>>>>>>>>>>>>>>> create
> >>>>>>>>>>>>>>>>>>>>>> to borrow ZK semantics).
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> The full image can definitely be
> >>> expressed
> >>>>> as a
> >>>>>>> sum
> >>>>>>>>>> of
> >>>>>>>>>>>>> deltas.  At
> >>>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>>> point, the number of deltas will get
> >>> large
> >>>>>>> enough
> >>>>>>>>>> that
> >>>>>>>>>>>>> sending a
> >>>>>>>>>>>>>>>>> full
> >>>>>>>>>>>>>>>>>>>> image
> >>>>>>>>>>>>>>>>>>>>> is better, though.  One question that
> >>> we're
> >>>>>>> still
> >>>>>>>>>> thinking
> >>>>>>>>>>>>> about is
> >>>>>>>>>>>>>>>>>> how
> >>>>>>>>>>>>>>>>>>>>> much of this can be shared with generic
> >>>>> Kafka
> >>>>>>> log
> >>>>>>>>>> code,
> >>>>>>>>>>>>> and how
> >>>>>>>>>>>>>>>>> much
> >>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>> be different.
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> In section "Broker Metadata
> >>> Management",
> >>>>> you
> >>>>>>>>>> mention
> >>>>>>>>>>>>> "This
> >>>>>>>>>>>>>>>>> request
> >>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>>>> double as a heartbeat, letting the
> >>>>> controller
> >>>>>>>>>> know that
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> broker
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>> alive". In section "Broker State
> >>>>> Machine", you
> >>>>>>>>>> mention
> >>>>>>>>>>>>> "The
> >>>>>>>>>>>>>>>>>>>> MetadataFetch
> >>>>>>>>>>>>>>>>>>>>>> API serves as this registration
> >>>>> mechanism".
> >>>>>>> Does
> >>>>>>>>>> this
> >>>>>>>>>>>>> mean that
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> MetadataFetch Request will optionally
> >>>>> include
> >>>>>>>>>> broker
> >>>>>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>>>>>>>>>> information?
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> I was originally thinking that the
> >>>>>>>>>> MetadataFetchRequest
> >>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> include
> >>>>>>>>>>>>>>>>>>>>> broker configuration information.
> >>> Thinking
> >>>>>>> about
> >>>>>>>>>> this
> >>>>>>>>>>>>> more, maybe
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>> should just have a special registration
> >>> RPC
> >>>>> that
> >>>>>>>>>> contains
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>> information,
> >>>>>>>>>>>>>>>>>>>>> to avoid sending it over the wire all
> >>> the
> >>>>> time.
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> Does this also mean that MetadataFetch
> >>>>> request
> >>>>>>>>>> will
> >>>>>>>>>>>>> result in
> >>>>>>>>>>>>>>>>>>>>>> a "write"/AppendEntries through the
> >>> Raft
> >>>>>>>>>> replication
> >>>>>>>>>>>>> protocol
> >>>>>>>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>>> can send the associated MetadataFetch
> >>>>>>> Response?
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> I think we should require the broker to
> >>> be
> >>>>> out
> >>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>>> Offline state
> >>>>>>>>>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>>>>>> allowing it to fetch metadata, yes.  So
> >>> the
> >>>>>>> separate
> >>>>>>>>>>>>> registration
> >>>>>>>>>>>>>>>>> RPC
> >>>>>>>>>>>>>>>>>>>>> should have completed first.
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> In section "Broker State", you mention
> >>>>> that a
> >>>>>>>>>> broker can
> >>>>>>>>>>>>>>>>> transition
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>> online after it is caught with the
> >>>>> metadata.
> >>>>>>> What
> >>>>>>>>>> do you
> >>>>>>>>>>>>> mean by
> >>>>>>>>>>>>>>>>>>> this?
> >>>>>>>>>>>>>>>>>>>>>> Metadata is always changing. How does
> >>> the
> >>>>>>> broker
> >>>>>>>>>> know
> >>>>>>>>>>>>> that it is
> >>>>>>>>>>>>>>>>>>> caught
> >>>>>>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>>>>> since it doesn't participate in the
> >>>>> consensus
> >>>>>>> or
> >>>>>>>>>> the
> >>>>>>>>>>>>> advancement
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> highwatermark?
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> That's a good point.  Being "caught up"
> >>> is
> >>>>>>> somewhat
> >>>>>>>>>> of a
> >>>>>>>>>>>>> fuzzy
> >>>>>>>>>>>>>>>>>> concept
> >>>>>>>>>>>>>>>>>>>>> here, since the brokers do not
> >>> participate
> >>>>> in
> >>>>>>> the
> >>>>>>>>>> metadata
> >>>>>>>>>>>>>>>>> consensus.
> >>>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>> think ideally we would want to define
> >>> it in
> >>>>>>> terms
> >>>>>>>>>> of time
> >>>>>>>>>>>>> ("the
> >>>>>>>>>>>>>>>>>> broker
> >>>>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>> all the updates from the last 2
> >>> minutes",
> >>>>> for
> >>>>>>>>>> example.)
> >>>>>>>>>>>>> We should
> >>>>>>>>>>>>>>>>>>> spell
> >>>>>>>>>>>>>>>>>>>>> this out better in the KIP.
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> In section "Start the controller
> >>> quorum
> >>>>>>> nodes",
> >>>>>>>>>> you
> >>>>>>>>>>>>> mention "Once
> >>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>>>> taken over the /controller node, the
> >>>>> active
> >>>>>>>>>> controller
> >>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> proceed
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> load
> >>>>>>>>>>>>>>>>>>>>>> the full state of ZooKeeper.  It will
> >>>>> write
> >>>>>>> out
> >>>>>>>>>> this
> >>>>>>>>>>>>> information
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>> quorum's metadata storage.  After this
> >>>>> point,
> >>>>>>> the
> >>>>>>>>>>>>> metadata quorum
> >>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> the metadata store of record, rather
> >>> than
> >>>>> the
> >>>>>>>>>> data in
> >>>>>>>>>>>>> ZooKeeper."
> >>>>>>>>>>>>>>>>>>>> During
> >>>>>>>>>>>>>>>>>>>>>> this migration do should we expect to
> >>>>> have a
> >>>>>>>>>> small period
> >>>>>>>>>>>>>>>>>> controller
> >>>>>>>>>>>>>>>>>>>>>> unavailability while the controller
> >>>>> replicas
> >>>>>>> this
> >>>>>>>>>> state
> >>>>>>>>>>>>> to all of
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> raft
> >>>>>>>>>>>>>>>>>>>>>> nodes in the controller quorum and we
> >>>>> buffer
> >>>>>>> new
> >>>>>>>>>>>>> controller API
> >>>>>>>>>>>>>>>>>>>> requests?
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> Yes, the controller would be unavailable
> >>>>> during
> >>>>>>> this
> >>>>>>>>>>>>> time.  I don't
> >>>>>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>>> this will be that different from the
> >>> current
> >>>>>>> period
> >>>>>>>>>> of
> >>>>>>>>>>>>>>>>> unavailability
> >>>>>>>>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>>>>>> a new controller starts up and needs to
> >>>>> load the
> >>>>>>>>>> full
> >>>>>>>>>>>>> state from
> >>>>>>>>>>>>>>>>> ZK.
> >>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>>>> main difference is that in this period,
> >>> we'd
> >>>>>>> have
> >>>>>>>>>> to write
> >>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>>>>> controller quorum rather than just to
> >>>>> memory.
> >>>>>>> But
> >>>>>>>>>> we
> >>>>>>>>>>>>> believe this
> >>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>>>> be pretty fast.
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> regards,
> >>>>>>>>>>>>>>>>>>>>> Colin
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>>>>>>>> -Jose
> >>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> --
> >>>>>>>>>>>> David Arthur
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>> 
> >>>> 
> >>> 
> >> 
> 
>

Mime
View raw message