kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <...@confluent.io>
Subject Re: [DISCUSS] KIP-250 Add Support for Quorum-based Producer Acknowledgment
Date Tue, 20 Mar 2018 00:39:05 GMT
Hi, Jason,

Thanks for the comment. I didn't realize that the KIP was only focusing on
the latency of the producer. From an application's perspective, if an
application is latency sensitive, it probably wants to optimize the
end-to-end latency from the producer to the consumer. Litao, would you
agree with that?

Assuming that we want to also optimize the consumer latency, a second
question is whether it's still useful to maintain the concept of ISR. ISR
may not be strictly needed since one can just elect the leader from RF - Q
+ 1 replicas (RF is replication factor and Q is the quorum size). However,
there is some benefit of maintaining ISR when Q is not the majority.
Consider RF = 4 and Q = 2. Let's say at some point only 3 replicas are
alive and we only wait for the 2 fastest replicas. Then, another replica
fails. If we don't maintain ISR, then we can't select the new leader since
we need 3 replicas for leader election. If we maintain ISR, we can select
the new leader in this case. Arguably this is may not be a common use case.
However, this gives people more flexibility to optimize the latency.

If we want to keep the concept of ISR, the algorithm could be the
following. We maintain high watermark and ISR as they are today. The
producer will only receive an ack when Q replicas have received a record.
The leader will be maintaining a special high watermark (say Q-HW), which
is the min log end offset of the fastest Q in-sync replicas. In that mode,
the consumer can read records up to Q-HW. When the leader fails, the
replica in the remaining live in-sync replicas with the longest log will be
selected as the new leader. The guarantee is then if there are fewer than Q
replica failures, there is no data loss to the consumer.

A third question is whether the leader election should involve the
controller. Colin has several good points on the controller-less approach.
However, the pros for the controller approach are:
1. It batches the leader election and the propagation of the metadata when
leader changes. These lead to many fewer cross broker RPC requests during
leader election.
2. It serializes different events that may require leader change (e.g.,
reassign partitions, controlled shutdown, preferred leader election, etc).
This avoids conflicting decisions made at the same time and makes it easier
to reason about the sequencing of the metadata.

Supporting the above in a controller-less approach is possible, but
probably requires more work. So, perhaps this can be revisited in the
future.

Thanks,

Jun


On Thu, Mar 8, 2018 at 8:47 AM, Jason Gustafson <jason@confluent.io> wrote:

> Initially I thought this proposal was just about improving producer
> latency. So acks=quorum (or whatever) was sort of the midway point between
> acks=1 and acks=all, but offsets would only be exposed according to current
> high watermark semantics (meaning full ISR replication). However, it looks
> like we are also trying to allow messages to be exposed to consumers as
> soon as we have achieved replication to a quorum. In this case, I don't
> think the ISR leader election makes sense any longer unless we're willing
> to accept consumers seeing uncommitted data (which seems like a bad
> regression to me). Probably the only way you can avoid it is to also
> require a quorum for leader election. But maybe that's not so bad? If a
> quorum is not present, producers wouldn't be able to make progress anyway,
> so the only advantage of allowing election from a minority of replicas is
> potentially the ability to serve old reads. That should be possible from
> any replica anyway as long as we know that the data has been committed. So
> effectively we would be giving users a 2f + 1 option. And I agree with
> Colin that if we had that, future work could investigate moving the leader
> election out of the controller.
>
> -Jason
>
> On Tue, Mar 6, 2018 at 6:45 PM, Jun Rao <jun@confluent.io> wrote:
>
> > Hi, Colin,
> >
> > A couple of thoughts on your previous comments.
> >
> > 1. I am not sure if the controller is necessarily a bottleneck for leader
> > election. In the common case, leader election is due to a single broker
> > failure. So, the controller only needs to change the leaders for the
> number
> > of partitions on that broker, not for the whole cluster. Also, the
> > controller can do the leader election for all affected partitions in a
> > batch fashion and communicate the new leader for all partitions in a
> single
> > RPC request.
> >
> > 2. Keeping the metadata such as ISR in ZK in the quorum mode has some
> > benefits. Suppose that a user wants to wait for a quorum of 2 out of 4
> > replicas. At some point, replicas A and B are in sync, and replicas C
> and D
> > are out of sync. Later on, replicas A and B fail. By checking ISR, we can
> > choose not to elect a new leader from replicas C and D to avoid unclean
> > leader election. If we don't maintain ISR, this would be hard.
> >
> > Hi, Litao,
> >
> > At the high level, what you outlined makes sense. A few more comments.
> >
> > 10. If would be useful to clarify the semantics of the new configs. With
> a
> > quorum size of Q, I guess we want to wait until Q in-sync replicas have
> > received the message before exposing it to a consumer and acknowledging
> the
> > producer if acks=quorum is specified. If the producer uses acks=0 or 1,
> it
> > seems that the current semantic applies. However, if the producer uses
> > acks=all while the topic is configured with quorum, or if the producer
> > users acks=quorum while the topic is configured without quorum, should we
> > just reject the request?
> >
> > 11. Just having a boolean enable.quorum.acks in the topic config is not
> > enough, right? It seems that we need an integer value for the quorum
> size.
> > It
> > would useful for see what set of configurations makes more intuitive
> sense.
> >
> > For example, one option is to add sth like "quorum.required.acks" of type
> > integer. Then, should we allow both quorum.required.acks and min.isr set
> on
> > the same topic?
> >
> > Another option is to have a boolean config such as "enable.quorum.acks",
> > which defaults to false. If it's enabled, we just reuse min.isr as the
> > quorum size.
> >
> > 12. About the fencing issue that Guozhang mentioned. This is a potential
> > issue in the quorum mode. There are a few ways that we can improve this.
> > One approach is to force some kind of regular heartbeat between the
> broker
> > and the controller. If a broker hasn't received a heartbeat from a
> > controller for some time, the broker will pause any future writes on its
> > leaders until the heartbeat is resumed. Another approach is that when
> there
> > is log divergence, we could save the data to be truncated in the follower
> > to a separate file and provide an external tool for users to retrieve
> these
> > data, if they are needed. These improvements can potentially be done in
> > future KIPs.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Feb 12, 2018 at 3:02 PM, Litao Deng
> <litao.deng@airbnb.com.invalid
> > >
> > wrote:
> >
> > > Folks. Thanks for all of the good discussions.
> > >
> > > Here are a few of my thoughts:
> > >
> > >    1. we won't change any existing semantics. That means, besides acks
> > '-1
> > >    (all)', '0', '1', we will introduce a separate 'quorum' and document
> > the
> > >    semantic. 'quorum' is a totally different view of our replication
> > > protocol
> > >    for the sake of better tail (P999) latency. I will advertise don't
> > > compare
> > >    'quorum' with '-1 (all)' and any other existing values.
> > >    2. in terms of the controller functionality, I admit there are many
> > >    awesome consensus protocols; however, for this specific KIP, I
> choose
> > to
> > >    minimize the impact/change on the controller code path.
> > >       - we will keep the current controller's overall design and
> > >       implementation by NOT introducing any consensus protocol.
> > >       - we will introduce a topic level config "enable.quorum.acks"
> > >       (default to false), and only accept acks='quorum' produce
> requests
> > > while
> > >       the corresponding topic enabled this config. In this case, during
> > > the new
> > >       leader election, we will only use the new LEO-based new leader
> > > election for
> > >       the topics turned the "enable.quorum.acks" on. In this case, we
> > only
> > > do
> > >       LEO-based new leader election for the topics needed, and other
> > > topics won't
> > >       pay the penalty.
> > >
> > > One requirement for this KIP is fully semantic backward compatible and
> > > pay-as-you-go for the complexity of controller (longer new leader
> > election
> > > latency).
> > >
> > > Thoughts?
> > >
> > > On Mon, Feb 12, 2018 at 10:19 AM, Guozhang Wang <wangguoz@gmail.com>
> > > wrote:
> > >
> > > > Hello Tao,
> > > >
> > > > No I was not proposing to change the mechanism of acks=all, and only
> > > > mentioning that today even with acks=all the tolerance of failures is
> > > > theoretically still bounded by min.isr settings though we do require
> > all
> > > > replicas in ISR (which may be larger than min.isr) to replicate
> before
> > > > responding; this is what Jun mentioned may surprise many users
> today. I
> > > > think with an additional "acks=quorum" can help resolve this, by
> > > requiring
> > > > the num.acks >= majority (to make sure consistency is guaranteed with
> > at
> > > > most (X-1) / 2 failures with X number of replicas) AND num.acks >=
> > > min.isr
> > > > (to specify if we want tolerate more failures than (X-1) / 2).
> > > >
> > > > The question then is, whether acks=all is still useful with
> introduced
> > > > "quorum": if it is not, we can just replace the current semantics of
> > > "all"
> > > > and document it. The example that we gave above, demonstrate that
> > > > "acks=all" itself may still be useful even with the introduction of
> > > > "quorum" since that scenario can be avoided by acks=all, but not
> > > > acks=quorum as it requires ALL ISR replicas to replicate even if that
> > > > number is larger than {min.isr} and also larger than the majority
> > number
> > > > (and if A is trying to shrink its ISR from {A,B,C} to {A,B} it will
> > fail
> > > > the ZK write since epoch has been incremented). Hence my proposal is
> to
> > > add
> > > > a new config than replacing current semantics of "all".
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Sat, Feb 3, 2018 at 2:45 AM, tao xiao <xiaotao183@gmail.com>
> wrote:
> > > >
> > > > > Hi Guozhang,
> > > > >
> > > > > Are you proposing changing semantic of ack=all to acknowledge
> message
> > > > only
> > > > > after all replicas (not all ISRs, which is what Kafka currently is
> > > doing)
> > > > > have committed the message? This is equivalent to setting
> > > min.isr=number
> > > > of
> > > > > replicas, which makes ack=all much stricter than what Kafka has
> right
> > > > now.
> > > > > I think this may introduce surprise to users too as producer will
> not
> > > > > succeed in producing a message to Kafka when one of the followers
> is
> > > down
> > > > >
> > > > > On Sat, 3 Feb 2018 at 15:26 Guozhang Wang <wangguoz@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Dong,
> > > > > >
> > > > > > Could you elaborate a bit more how controller could affect
> leaders
> > to
> > > > > > switch between all and quorum?
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 2, 2018 at 10:12 PM, Dong Lin <lindong28@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hey Guazhang,
> > > > > > >
> > > > > > > Got it. Thanks for the detailed explanation. I guess my point
> is
> > > that
> > > > > we
> > > > > > > can probably achieve the best of both worlds, i.e. maintain the
> > > > > existing
> > > > > > > behavior of ack="all" while improving the tail latency.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Dong
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Feb 2, 2018 at 8:43 PM, Guozhang Wang <
> > wangguoz@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Dong,
> > > > > > >>
> > > > > > >> Yes, in terms of fault tolerance "quorum" does not do better
> > than
> > > > > "all",
> > > > > > >> as I said, with {min.isr} to X+1 Kafka is able to tolerate X
> > > > failures
> > > > > > only.
> > > > > > >> So if A and B are partitioned off at the same time, then there
> > are
> > > > two
> > > > > > >> concurrent failures and we do not guarantee all acked messages
> > > will
> > > > be
> > > > > > >> retained.
> > > > > > >>
> > > > > > >> The goal of my approach is to maintain the behavior of
> > ack="all",
> > > > > which
> > > > > > >> happen to do better than what Kafka is actually guaranteed:
> when
> > > > both
> > > > > A
> > > > > > and
> > > > > > >> B are partitioned off, produced records will not be acked
> since
> > > > "all"
> > > > > > >> requires all replicas (not only ISRs, my previous email has an
> > > > > incorrect
> > > > > > >> term) are required. This is doing better than tolerating X
> > > failures,
> > > > > > which
> > > > > > >> I was proposing to keep, so that we would not introduce any
> > > > regression
> > > > > > >> "surprises" to users who are already using "all". In other
> > words,
> > > > > > "quorum"
> > > > > > >> is trading a bit of failure tolerance that is strictly defined
> > on
> > > > > > min.isr
> > > > > > >> for better tail latency.
> > > > > > >>
> > > > > > >>
> > > > > > >> Guozhang
> > > > > > >>
> > > > > > >>
> > > > > > >> On Fri, Feb 2, 2018 at 6:25 PM, Dong Lin <lindong28@gmail.com
> >
> > > > wrote:
> > > > > > >>
> > > > > > >>> Hey Guozhang,
> > > > > > >>>
> > > > > > >>> According to the new proposal, with 3 replicas, min.isr=2 and
> > > > > > >>> acks="quorum", it seems that acknowledged messages can still
> be
> > > > > > truncated
> > > > > > >>> in the network partition scenario you mentioned, right? So I
> > > guess
> > > > > the
> > > > > > goal
> > > > > > >>> is for some user to achieve better tail latency at the cost
> of
> > > > > > potential
> > > > > > >>> message loss?
> > > > > > >>>
> > > > > > >>> If this is the case, then I think it may be better to adopt
> an
> > > > > approach
> > > > > > >>> where controller dynamically turn on/off this optimization.
> > This
> > > > > > provides
> > > > > > >>> user with peace of mind (i.e. no message loss) while still
> > > reducing
> > > > > > tail
> > > > > > >>> latency. What do you think?
> > > > > > >>>
> > > > > > >>> Thanks,
> > > > > > >>> Dong
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Fri, Feb 2, 2018 at 11:11 AM, Guozhang Wang <
> > > wangguoz@gmail.com
> > > > >
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Hello Litao,
> > > > > > >>>>
> > > > > > >>>> Just double checking on the leader election details, do you
> > have
> > > > > time
> > > > > > >>>> to complete the proposal on that part?
> > > > > > >>>>
> > > > > > >>>> Also Jun mentioned one caveat related to KIP-250 on the
> > KIP-232
> > > > > > >>>> discussion thread that Dong is working on, I figured it is
> > worth
> > > > > > pointing
> > > > > > >>>> out here with a tentative solution:
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> ```
> > > > > > >>>> Currently, if the producer uses acks=-1, a write will only
> > > succeed
> > > > > if
> > > > > > >>>> the write is received by all in-sync replicas (i.e.,
> > committed).
> > > > > This
> > > > > > >>>> is true even when min.isr is set since we first wait for a
> > > message
> > > > > to
> > > > > > >>>> be committed and then check the min.isr requirement. KIP-250
> > may
> > > > > > >>>> change that, but we can discuss the implication there.
> > > > > > >>>> ```
> > > > > > >>>>
> > > > > > >>>> The caveat is that, if we change the acking semantics in
> > KIP-250
> > > > > that
> > > > > > >>>> we will only requires num of {min.isr} to acknowledge a
> > produce,
> > > > > then
> > > > > > the
> > > > > > >>>> above scenario will have a caveat: imagine you have {A, B,
> C}
> > > > > > replicas of a
> > > > > > >>>> partition with A as the leader, all in the isr list, and
> > min.isr
> > > > is
> > > > > 2.
> > > > > > >>>>
> > > > > > >>>> 1. Say there is a network partition and both A and B are
> > fenced
> > > > > off. C
> > > > > > >>>> is elected as the new leader, it shrinks its isr list to
> only
> > > {C};
> > > > > > from A's
> > > > > > >>>> point of view it does not know it becomes the "ghost" and no
> > > > longer
> > > > > > the
> > > > > > >>>> leader, all it does is shrinking the isr list to {A, B}.
> > > > > > >>>>
> > > > > > >>>> 2. At this time, any new writes with ack=-1 to C will not be
> > > > acked,
> > > > > > >>>> since from C's pov there is only one replica. This is
> correct.
> > > > > > >>>>
> > > > > > >>>> 3. However, any writes that are send to A (NOTE this is
> > totally
> > > > > > >>>> possible, since producers would only refresh metadata
> > > > periodically,
> > > > > > >>>> additionally if they happen to ask A or B they will get the
> > > stale
> > > > > > metadata
> > > > > > >>>> that A's still the leader), since A thinks that isr list is
> > {A,
> > > B}
> > > > > > and as
> > > > > > >>>> long as B has replicated the message, A can acked the
> produce.
> > > > > > >>>>
> > > > > > >>>>     This is not correct behavior, since when network heals,
> A
> > > > would
> > > > > > >>>> realize it is not the leader and will truncate its log. And
> > > hence
> > > > > as a
> > > > > > >>>> result the acked records are lost, violating Kafka's
> > guarantees.
> > > > And
> > > > > > >>>> KIP-232 would not help preventing this scenario.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> Although one can argue that, with 3 replicas and min.isr set
> > to
> > > 2,
> > > > > > >>>> Kafka is guaranteeing to tolerate only one failure, while
> the
> > > > above
> > > > > > >>>> scenario is actually two concurrent failures (both A and B
> are
> > > > > > considered
> > > > > > >>>> wedged), this is still a regression to the current version.
> > > > > > >>>>
> > > > > > >>>> So to resolve this issue, I'd propose we can change the
> > > semantics
> > > > in
> > > > > > >>>> the following way (this is only slightly different from your
> > > > > > proposal):
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> 1. Add one more value to client-side acks config:
> > > > > > >>>>
> > > > > > >>>>    0: no acks needed at all.
> > > > > > >>>>    1: ack from the leader.
> > > > > > >>>>    all: ack from ALL the ISR replicas AND that current
> number
> > of
> > > > isr
> > > > > > >>>> replicas has to be no smaller than {min.isr} (i.e. not
> > changing
> > > > this
> > > > > > >>>> semantic).
> > > > > > >>>>    quorum: this is the new value, it requires ack from
> enough
> > > > number
> > > > > > of
> > > > > > >>>> ISR replicas no smaller than majority of the replicas AND no
> > > > smaller
> > > > > > than
> > > > > > >>>> {min.isr}.
> > > > > > >>>>
> > > > > > >>>> 2. Clarify in the docs that if a user wants to tolerate X
> > > > failures,
> > > > > > she
> > > > > > >>>> needs to set client acks=all or acks=quorum (better tail
> > latency
> > > > > than
> > > > > > >>>> "all") with broker {min.sir} to be X+1; however, "all" is
> not
> > > > > > necessarily
> > > > > > >>>> stronger than "quorum":
> > > > > > >>>>
> > > > > > >>>> For example, with 3 replicas, and {min.isr} set to 2. Here
> is
> > a
> > > > list
> > > > > > of
> > > > > > >>>> scenarios:
> > > > > > >>>>
> > > > > > >>>> a. ISR list has 3: "all" waits for all 3, "quorum" waits
> for 2
> > > of
> > > > > > them.
> > > > > > >>>> b. ISR list has 2: "all" and "quorum" waits for both 2 of
> > them.
> > > > > > >>>> c. ISR list has 1: "all" and "quorum" would not ack.
> > > > > > >>>>
> > > > > > >>>> If {min.isr} is set to 1, interestingly, here would be the
> > list
> > > of
> > > > > > >>>> scenarios:
> > > > > > >>>>
> > > > > > >>>> a. ISR list has 3: "all" waits for all 3, "quorum" waits
> for 2
> > > of
> > > > > > them.
> > > > > > >>>> b. ISR list has 2: "all" and "quorum" waits for both 2 of
> > them.
> > > > > > >>>> c. ISR list has 1: "all" waits for leader to return, while
> > > > "quorum"
> > > > > > >>>> would not ack (because it requires that number > {min.isr},
> > AND
> > > >=
> > > > > > >>>> {majority of num.replicas}, so its actually stronger than
> > > "all").
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> WDYT?
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> Guozhang
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Thu, Jan 25, 2018 at 8:13 PM, Dong Lin <
> > lindong28@gmail.com>
> > > > > > wrote:
> > > > > > >>>>
> > > > > > >>>>> Hey Litao,
> > > > > > >>>>>
> > > > > > >>>>> Not sure there will be an easy way to select the broker
> with
> > > > > highest
> > > > > > >>>>> LEO
> > > > > > >>>>> without losing acknowledged message. In case it is useful,
> > here
> > > > is
> > > > > > >>>>> another
> > > > > > >>>>> idea. Maybe we can have a mechanism to turn switch between
> > the
> > > > > > min.isr
> > > > > > >>>>> and
> > > > > > >>>>> isr set for determining when to acknowledge a message.
> > > Controller
> > > > > can
> > > > > > >>>>> probably use RPC to request the current leader to use isr
> set
> > > > > before
> > > > > > it
> > > > > > >>>>> sends LeaderAndIsrRequest for leadership change.
> > > > > > >>>>>
> > > > > > >>>>> Regards,
> > > > > > >>>>> Dong
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> On Wed, Jan 24, 2018 at 7:29 PM, Litao Deng
> > > > > > >>>>> <litao.deng@airbnb.com.invalid>
> > > > > > >>>>> wrote:
> > > > > > >>>>>
> > > > > > >>>>> > Thanks Jun for the detailed feedback.
> > > > > > >>>>> >
> > > > > > >>>>> > Yes, for #1, I mean the live replicas from the ISR.
> > > > > > >>>>> >
> > > > > > >>>>> > Actually, I believe for all of the 4 new leader election
> > > > > strategies
> > > > > > >>>>> > (offline, reassign, preferred replica and controlled
> > > shutdown),
> > > > > we
> > > > > > >>>>> need to
> > > > > > >>>>> > make corresponding changes. Will document the details in
> > the
> > > > KIP.
> > > > > > >>>>> >
> > > > > > >>>>> > On Wed, Jan 24, 2018 at 3:59 PM, Jun Rao <
> jun@confluent.io
> > >
> > > > > wrote:
> > > > > > >>>>> >
> > > > > > >>>>> > > Hi, Litao,
> > > > > > >>>>> > >
> > > > > > >>>>> > > Thanks for the KIP. Good proposal. A few comments
> below.
> > > > > > >>>>> > >
> > > > > > >>>>> > > 1. The KIP says "select the live replica with the
> largest
> > > > LEO".
> > > > > > I
> > > > > > >>>>> guess
> > > > > > >>>>> > > what you meant is selecting the live replicas in ISR
> with
> > > the
> > > > > > >>>>> largest
> > > > > > >>>>> > LEO?
> > > > > > >>>>> > >
> > > > > > >>>>> > > 2. I agree that we can probably just reuse the current
> > > > min.isr
> > > > > > >>>>> > > configuration, but with a slightly different semantics.
> > > > > > Currently,
> > > > > > >>>>> if
> > > > > > >>>>> > > min.isr is set, a user expects the record to be in at
> > least
> > > > > > min.isr
> > > > > > >>>>> > > replicas on successful ack. This KIP guarantees this
> too.
> > > > Most
> > > > > > >>>>> people are
> > > > > > >>>>> > > probably surprised that currently the ack is only sent
> > back
> > > > > after
> > > > > > >>>>> all
> > > > > > >>>>> > > replicas in ISR receive the record. This KIP will
> change
> > > the
> > > > > ack
> > > > > > >>>>> to only
> > > > > > >>>>> > > wait on min.isr replicas, which matches the user's
> > > > expectation
> > > > > > and
> > > > > > >>>>> gives
> > > > > > >>>>> > > better latency. Currently, we guarantee no data loss if
> > > there
> > > > > are
> > > > > > >>>>> fewer
> > > > > > >>>>> > > than replication factor failures. The KIP changes that
> to
> > > > fewer
> > > > > > >>>>> than
> > > > > > >>>>> > > min.isr failures. The latter probably matches the user
> > > > > > expectation.
> > > > > > >>>>> > >
> > > > > > >>>>> > > 3. I agree that the new leader election process is a
> bit
> > > more
> > > > > > >>>>> > complicated.
> > > > > > >>>>> > > The controller now needs to contact all replicas in ISR
> > to
> > > > > > >>>>> determine who
> > > > > > >>>>> > > has the longest log. However, this happens
> infrequently.
> > > So,
> > > > > it's
> > > > > > >>>>> > probably
> > > > > > >>>>> > > worth doing for the better latency in #2.
> > > > > > >>>>> > >
> > > > > > >>>>> > > 4. We have to think through the preferred leader
> election
> > > > > > process.
> > > > > > >>>>> > > Currently, the first assigned replica is preferred for
> > load
> > > > > > >>>>> balancing.
> > > > > > >>>>> > > There is a process to automatically move the leader to
> > the
> > > > > > >>>>> preferred
> > > > > > >>>>> > > replica when it's in sync. The issue is that the
> > preferred
> > > > > > replica
> > > > > > >>>>> may no
> > > > > > >>>>> > > be the replica with the longest log. Naively switching
> to
> > > the
> > > > > > >>>>> preferred
> > > > > > >>>>> > > replica may cause data loss when there are actually
> fewer
> > > > > > failures
> > > > > > >>>>> than
> > > > > > >>>>> > > configured min.isr. One way to address this issue is to
> > do
> > > > the
> > > > > > >>>>> following
> > > > > > >>>>> > > steps during preferred leader election: (a) controller
> > > sends
> > > > an
> > > > > > RPC
> > > > > > >>>>> > request
> > > > > > >>>>> > > to the current leader; (b) the current leader stops
> > taking
> > > > new
> > > > > > >>>>> writes
> > > > > > >>>>> > > (sending a new error code to the clients) and returns
> its
> > > LEO
> > > > > > >>>>> (call it L)
> > > > > > >>>>> > > to the controller; (c) the controller issues an RPC
> > request
> > > > to
> > > > > > the
> > > > > > >>>>> > > preferred replica and waits its LEO to reach L; (d) the
> > > > > > controller
> > > > > > >>>>> > changes
> > > > > > >>>>> > > the leader to the preferred replica.
> > > > > > >>>>> > >
> > > > > > >>>>> > > Jun
> > > > > > >>>>> > >
> > > > > > >>>>> > > On Wed, Jan 24, 2018 at 2:51 PM, Litao Deng
> > > > > > >>>>> > <litao.deng@airbnb.com.invalid
> > > > > > >>>>> > > >
> > > > > > >>>>> > > wrote:
> > > > > > >>>>> > >
> > > > > > >>>>> > > > Sorry folks, just realized I didn't use the correct
> > > thread
> > > > > > >>>>> format for
> > > > > > >>>>> > the
> > > > > > >>>>> > > > discussion. I started this new one and copied all of
> > the
> > > > > > >>>>> responses from
> > > > > > >>>>> > > the
> > > > > > >>>>> > > > old one.
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > @Dong
> > > > > > >>>>> > > > It makes sense to just use the min.insync.replicas
> > > instead
> > > > of
> > > > > > >>>>> > > introducing a
> > > > > > >>>>> > > > new config, and we must make this change together
> with
> > > the
> > > > > > >>>>> LEO-based
> > > > > > >>>>> > new
> > > > > > >>>>> > > > leader election.
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > @Xi
> > > > > > >>>>> > > > I thought about embedding the LEO information to the
> > > > > > >>>>> ControllerContext,
> > > > > > >>>>> > > > didn't find a way. Using RPC will make the leader
> > > election
> > > > > > period
> > > > > > >>>>> > longer
> > > > > > >>>>> > > > and this should happen in very rare cases (broker
> > > failure,
> > > > > > >>>>> controlled
> > > > > > >>>>> > > > shutdown, preferred leader election and partition
> > > > > > reassignment).
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > @Jeff
> > > > > > >>>>> > > > The current leader election is to pick the first
> > replica
> > > > from
> > > > > > AR
> > > > > > >>>>> which
> > > > > > >>>>> > > > exists both in the live brokers and ISR sets. I agree
> > > with
> > > > > you
> > > > > > >>>>> about
> > > > > > >>>>> > > > changing the current/default behavior will cause many
> > > > > > >>>>> confusions, and
> > > > > > >>>>> > > > that's the reason the title is "Add Support ...". In
> > this
> > > > > case,
> > > > > > >>>>> we
> > > > > > >>>>> > > wouldn't
> > > > > > >>>>> > > > break any current promises and provide a separate
> > option
> > > > for
> > > > > > our
> > > > > > >>>>> user.
> > > > > > >>>>> > > > In terms of KIP-250, I feel it is more like the
> > > > > > "Semisynchronous
> > > > > > >>>>> > > > Replication" in the MySQL world, and yes it is
> > something
> > > > > > between
> > > > > > >>>>> acks=1
> > > > > > >>>>> > > and
> > > > > > >>>>> > > > acks=insync.replicas. Additionally, I feel KIP-250
> and
> > > > > KIP-227
> > > > > > >>>>> are
> > > > > > >>>>> > > > two orthogonal improvements. KIP-227 is to improve
> the
> > > > > > >>>>> replication
> > > > > > >>>>> > > protocol
> > > > > > >>>>> > > > (like the introduction of parallel replication in
> > MySQL),
> > > > and
> > > > > > >>>>> KIP-250
> > > > > > >>>>> > is
> > > > > > >>>>> > > an
> > > > > > >>>>> > > > enhancement for the replication architecture (sync,
> > > > > semi-sync,
> > > > > > >>>>> and
> > > > > > >>>>> > > async).
> > > > > > >>>>> > > >
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > Dong Lin
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > > Thanks for the KIP. I have one quick comment before
> > you
> > > > > > >>>>> provide more
> > > > > > >>>>> > > > detail
> > > > > > >>>>> > > > > on how to select the leader with the largest LEO.
> > > > > > >>>>> > > > > Do you think it would make sense to change the
> > default
> > > > > > >>>>> behavior of
> > > > > > >>>>> > > > acks=-1,
> > > > > > >>>>> > > > > such that broker will acknowledge the message once
> > the
> > > > > > message
> > > > > > >>>>> has
> > > > > > >>>>> > been
> > > > > > >>>>> > > > > replicated to min.insync.replicas brokers? This
> would
> > > > allow
> > > > > > us
> > > > > > >>>>> to
> > > > > > >>>>> > keep
> > > > > > >>>>> > > > the
> > > > > > >>>>> > > > > same durability guarantee, improve produce request
> > > > latency
> > > > > > >>>>> without
> > > > > > >>>>> > > > having a
> > > > > > >>>>> > > > > new config.
> > > > > > >>>>> > > >
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > Hu Xi
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > > Currently,  with holding the assigned replicas(AR)
> > for
> > > > all
> > > > > > >>>>> > partitions,
> > > > > > >>>>> > > > > controller is now able to elect new leaders by
> > > selecting
> > > > > the
> > > > > > >>>>> first
> > > > > > >>>>> > > > replica
> > > > > > >>>>> > > > > of AR which occurs in both live replica set and
> ISR.
> > If
> > > > > > >>>>> switching to
> > > > > > >>>>> > > the
> > > > > > >>>>> > > > > LEO-based strategy, controller context might need
> to
> > be
> > > > > > >>>>> enriched or
> > > > > > >>>>> > > > > augmented to store those values.  If retrieving
> those
> > > > LEOs
> > > > > > >>>>> real-time,
> > > > > > >>>>> > > > > several rounds of RPCs are unavoidable which seems
> to
> > > > > violate
> > > > > > >>>>> the
> > > > > > >>>>> > > > original
> > > > > > >>>>> > > > > intention of this KIP.‚Äč
> > > > > > >>>>> > > >
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > Jeff Widman
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > > I agree with Dong, we should see if it's possible
> to
> > > > change
> > > > > > the
> > > > > > >>>>> > default
> > > > > > >>>>> > > > > behavior so that as soon as min.insync.replicas
> > brokers
> > > > > > >>>>> respond than
> > > > > > >>>>> > > the
> > > > > > >>>>> > > > > broker acknowledges the message back to the client
> > > > without
> > > > > > >>>>> waiting
> > > > > > >>>>> > for
> > > > > > >>>>> > > > > additional brokers who are in the in-sync replica
> > list
> > > to
> > > > > > >>>>> respond. (I
> > > > > > >>>>> > > > > actually thought it already worked this way).
> > > > > > >>>>> > > > > As you implied in the KIP though, changing this
> > default
> > > > > > >>>>> introduces a
> > > > > > >>>>> > > > weird
> > > > > > >>>>> > > > > state where an in-sync follower broker is not
> > > guaranteed
> > > > to
> > > > > > >>>>> have a
> > > > > > >>>>> > > > > message...
> > > > > > >>>>> > > > > So at a minimum, the leadership failover algorithm
> > > would
> > > > > need
> > > > > > >>>>> to be
> > > > > > >>>>> > > sure
> > > > > > >>>>> > > > to
> > > > > > >>>>> > > > > pick the most up-to-date follower... I thought it
> > > already
> > > > > did
> > > > > > >>>>> this?
> > > > > > >>>>> > > > > But if multiple brokers fail in quick succession,
> > then
> > > a
> > > > > > >>>>> broker that
> > > > > > >>>>> > > was
> > > > > > >>>>> > > > in
> > > > > > >>>>> > > > > the ISR could become a leader without ever
> receiving
> > > the
> > > > > > >>>>> message...
> > > > > > >>>>> > > > > violating the current promises of
> > > > unclean.leader.election.
> > > > > > >>>>> > > > enable=False...
> > > > > > >>>>> > > > > so changing the default might be not be a tenable
> > > > solution.
> > > > > > >>>>> > > > > What also jumped out at me in the KIP was the goal
> of
> > > > > > reducing
> > > > > > >>>>> p999
> > > > > > >>>>> > > when
> > > > > > >>>>> > > > > setting replica lag time at 10 seconds(!!)... I
> > > > understand
> > > > > > the
> > > > > > >>>>> desire
> > > > > > >>>>> > > to
> > > > > > >>>>> > > > > minimize frequent ISR shrink/expansion, as I face
> > this
> > > > same
> > > > > > >>>>> issue at
> > > > > > >>>>> > my
> > > > > > >>>>> > > > day
> > > > > > >>>>> > > > > job. But what you're essentially trying to do here
> is
> > > > > create
> > > > > > an
> > > > > > >>>>> > > > additional
> > > > > > >>>>> > > > > replication state that is in-between acks=1 and
> acks
> > =
> > > > ISR
> > > > > to
> > > > > > >>>>> paper
> > > > > > >>>>> > > over
> > > > > > >>>>> > > > a
> > > > > > >>>>> > > > > root problem of ISR shrink/expansion...
> > > > > > >>>>> > > > > I'm just wary of shipping more features (and more
> > > > > operational
> > > > > > >>>>> > > confusion)
> > > > > > >>>>> > > > if
> > > > > > >>>>> > > > > it's only addressing the symptom rather than the
> root
> > > > > cause.
> > > > > > >>>>> For
> > > > > > >>>>> > > example,
> > > > > > >>>>> > > > > my day job's problem is we run a very high number
> of
> > > > > > >>>>> low-traffic
> > > > > > >>>>> > > > > partitions-per-broker, so the fetch requests hit
> many
> > > > > > >>>>> partitions
> > > > > > >>>>> > before
> > > > > > >>>>> > > > > they fill. Solving that requires changing our
> > > > architecture
> > > > > +
> > > > > > >>>>> making
> > > > > > >>>>> > the
> > > > > > >>>>> > > > > replication protocol more efficient (KIP-227).
> > > > > > >>>>> > > >
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > On Tue, Jan 23, 2018 at 10:02 PM, Litao Deng <
> > > > > > >>>>> litao.deng@airbnb.com>
> > > > > > >>>>> > > > wrote:
> > > > > > >>>>> > > >
> > > > > > >>>>> > > > > Hey folks. I would like to add a feature to support
> > the
> > > > > > >>>>> quorum-based
> > > > > > >>>>> > > > > acknowledgment for the producer request. We have
> been
> > > > > > running a
> > > > > > >>>>> > > modified
> > > > > > >>>>> > > > > version of Kafka on our testing cluster for weeks,
> > the
> > > > > > >>>>> improvement of
> > > > > > >>>>> > > > P999
> > > > > > >>>>> > > > > is significant with very stable latency.
> > Additionally,
> > > I
> > > > > > have a
> > > > > > >>>>> > > proposal
> > > > > > >>>>> > > > to
> > > > > > >>>>> > > > > achieve a similar data durability as with the
> > > > > > >>>>> insync.replicas-based
> > > > > > >>>>> > > > > acknowledgment through LEO-based leader election.
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > > > https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > > > > > >>>>> > > > > 250+Add+Support+for+Quorum-
> > based+Producer+Acknowledge
> > > > > > >>>>> > > > >
> > > > > > >>>>> > > >
> > > > > > >>>>> > >
> > > > > > >>>>> >
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> --
> > > > > > >>>> -- Guozhang
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> -- Guozhang
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message