kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dong Lin <lindon...@gmail.com>
Subject Re: [DISCUSS] KIP-112: Handle disk failure for JBOD
Date Tue, 28 Feb 2017 22:41:37 GMT
Hey Jun,

I just realized that StopReplicaRequest itself doesn't specify the
replicaId in the wire protocol. Thus controller would need to log the
brokerId with StopReplicaRequest in the log. Thus it may be
reasonable for controller to do the same with LeaderAndIsrRequest and only
specify the isNewReplica for the broker that receives LeaderAndIsrRequest.

Thanks,
Dong

On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin <lindong28@gmail.com> wrote:

> Hi Jun,
>
> Yeah there is tradeoff between controller's implementation complexity vs.
> wire-protocol complexity. I personally think it is more important to keep
> wire-protocol concise and only add information in wire-protocol if
> necessary. It seems fine to add a little bit complexity to controller's
> implementation, e.g. log destination broker per LeaderAndIsrRequet. Becket
> also shares this opinion with me. Is the only purpose of doing so to make
> controller log simpler?
>
> And certainly, I have added Todd's comment in the wiki.
>
> Thanks,
> Dong
>
>
> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao <jun@confluent.io> wrote:
>
>> Hi, Dong,
>>
>> 52. What you suggested would work. However, I am thinking that it's
>> probably simpler to just set isNewReplica at the replica level. That way,
>> the LeaderAndIsrRequest can be created a bit simpler. When reading a
>> LeaderAndIsrRequest in the controller log, it's easier to see which
>> replicas are new without looking at which broker the request is intended
>> for.
>>
>> Could you also add those additional points from Todd's on 1 broker per
>> disk
>> vs JBOD vs RAID5/6 to the KIP?
>>
>> Thanks,
>>
>> Hi, Todd,
>>
>> Thanks for the feedback. That's very useful.
>>
>> Jun
>>
>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin <lindong28@gmail.com> wrote:
>>
>> > Hey Jun,
>> >
>> > Certainly, I have added Todd to reply to the thread. And I have updated
>> the
>> > item to in the wiki.
>> >
>> > 50. The full statement is "Broker assumes a log directory to be good
>> after
>> > it starts, and mark log directory as bad once there is IOException when
>> > broker attempts to access (i.e. read or write) the log directory". This
>> > statement seems reasonable, right? If a log directory is actually bad,
>> then
>> > the broker will first assume it is OK, try to read logs on this log
>> > directory, encounter IOException, and then mark it as bad.
>> >
>> > 51. My bad. I thought I removed it but I didn't. It is removed now.
>> >
>> > 52. I don't think so.. The isNewReplica field in the
>> LeaderAndIsrRequest is
>> > only relevant to the replica (i.e. broker) that receives the
>> > LeaderAndIsrRequest. There is no need to specify whether each replica is
>> > new inside LeaderAndIsrRequest. In other words, if a broker sends
>> > LeaderAndIsrRequest to three different replicas of a given partition,
>> the
>> > isNewReplica field can be different across these three requests.
>> >
>> > Yeah, I would definitely want to start discussion on KIP-113 after we
>> have
>> > reached agreement on KIP-112. I have actually opened KIP-113 discussion
>> > thread on 1/12 together with this thread. I have yet to add the ability
>> to
>> > list offline directories in KIP-113 which we discussed in this thread.
>> >
>> > Thanks for all your reviews! Is there further concern with the latest
>> KIP?
>> >
>> > Thanks!
>> > Dong
>> >
>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao <jun@confluent.io> wrote:
>> >
>> > > Hi, Dong,
>> > >
>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks failure.
>> > Eno's
>> > > point is that the rebuild of RAID5/RAID6 requires reading more data
>> > > compared with RAID10, which increases the probability of error during
>> > > rebuild. This makes sense. In any case, do you think you could ask the
>> > SREs
>> > > at LinkedIn to share their opinions on RAID5/RAID6?
>> > >
>> > > Yes, when a replica is offline due to a bad disk, it makes sense to
>> > handle
>> > > it immediately as if a StopReplicaRequest is received (i.e., replica
>> is
>> > no
>> > > longer considered a leader and is removed from any replica fetcher
>> > thread).
>> > > Could you add that detail in item 2. in the wiki?
>> > >
>> > > 50. The wiki says "Broker assumes a log directory to be good after it
>> > > starts" : A log directory actually could be bad during startup.
>> > >
>> > > 51. In item 4, the wiki says "The controller watches the path
>> > > /log_dir_event_notification for new znode.". This doesn't seem be
>> needed
>> > > now?
>> > >
>> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for each
>> > > replica inside the replicas field, right?
>> > >
>> > > Other than those, the current KIP looks good to me. Do you want to
>> start
>> > a
>> > > separate discussion thread on KIP-113? I do have some comments there.
>> > >
>> > > Thanks for working on this!
>> > >
>> > > Jun
>> > >
>> > >
>> > > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin <lindong28@gmail.com>
>> wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > In addition to the Eno's reference of why rebuild time with RAID-5
>> is
>> > > more
>> > > > expensive, another concern is that RAID-5 will fail if more than one
>> > disk
>> > > > fails. JBOD is still works with 1+ disk failure and has better
>> > > performance
>> > > > with one disk failure. These seems like good argument for using JBOD
>> > > > instead of RAID-5.
>> > > >
>> > > > If a leader replica goes offline, the broker should first take all
>> > > actions
>> > > > (i.e. remove the partition from fetcher thread) as if it has
>> received
>> > > > StopReplicaRequest for this partition because the replica can no
>> longer
>> > > > work anyway. It will also respond with error to any ProduceRequest
>> and
>> > > > FetchRequest for partition. The broker notifies controller by
>> writing
>> > > > notification znode in ZK. The controller learns the disk failure
>> event
>> > > from
>> > > > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to
>> > learn
>> > > > that the replica is offline. The controller will then elect new
>> leader
>> > > for
>> > > > this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest
>> to
>> > > > relevant brokers. The broker should stop adjusting the ISR for this
>> > > > partition as if the broker is already offline. I am not sure there
>> is
>> > any
>> > > > inconsistency in broker's behavior when it is leader or follower.
Is
>> > > there
>> > > > any concern with this approach?
>> > > >
>> > > > Thanks for catching this. I have removed that reference from the
>> KIP.
>> > > >
>> > > > Hi Eno,
>> > > >
>> > > > Thank you for providing the reference of the RAID-5. In LinkedIn we
>> > have
>> > > 10
>> > > > disks per Kafka machine. It will not be a show-stopper operationally
>> > for
>> > > > LinkedIn if we have to deploy one-broker-per-disk. On the other
>> hand we
>> > > > previously discussed the advantage of JBOD vs. one-broker-per-disk
>> or
>> > > > one-broker-per-machine. One-broker-per-disk suffers from the
>> problems
>> > > > described in the KIP and one-broker-per-machine increases the
>> failure
>> > > > caused by disk failure by 10X. Since JBOD is strictly better than
>> > either
>> > > of
>> > > > the two, it is also better then one-broker-per-multiple-disk which
>> is
>> > > > somewhere between one-broker-per-disk and one-broker-per-machine.
>> > > >
>> > > > I personally think the benefits of JBOD design is worth the
>> > > implementation
>> > > > complexity it introduces. I would also argue that it is reasonable
>> for
>> > > > Kafka to manage this low level detail because Kafka is already
>> exposing
>> > > and
>> > > > managing replication factor of its data. But whether the complexity
>> is
>> > > > worthwhile can be subjective and I can not prove my opinion. I am
>> > > > contributing significant amount of time to do this KIP because Kafka
>> > > > develops at LinkedIn believes it is useful and worth the effort.
>> Yeah,
>> > it
>> > > > will be useful to see what everyone else think about it.
>> > > >
>> > > >
>> > > > Thanks,
>> > > > Dong
>> > > >
>> > > >
>> > > > On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <jun@confluent.io>
wrote:
>> > > >
>> > > > > Hi, Dong,
>> > > > >
>> > > > > For RAID5, I am not sure the rebuild cost is a big concern. If
a
>> disk
>> > > > > fails, typically an admin has to bring down the broker, replace
>> the
>> > > > failed
>> > > > > disk with a new one, trigger the RAID rebuild, and bring up the
>> > broker.
>> > > > > This way, there is no performance impact at runtime due to
>> rebuild.
>> > The
>> > > > > benefit is that a broker doesn't fail in a hard way when there
is
>> a
>> > > disk
>> > > > > failure and can be brought down in a controlled way for
>> maintenance.
>> > > > While
>> > > > > the broker is running with a failed disk, reads may be more
>> expensive
>> > > > since
>> > > > > they have to be computed from the parity. However, if most reads
>> are
>> > > from
>> > > > > page cache, this may not be a big issue either. So, it would
be
>> > useful
>> > > to
>> > > > > do some tests on RAID5 before we completely rule it out.
>> > > > >
>> > > > > Regarding whether to remove an offline replica from the fetcher
>> > thread
>> > > > > immediately. What do we do when a failed replica is a leader?
Do
>> we
>> > do
>> > > > > nothing or mark the replica as not the leader immediately?
>> > Intuitively,
>> > > > it
>> > > > > seems it's better if the broker acts consistently on a failed
>> replica
>> > > > > whether it's a leader or a follower. For ISR churns, I was just
>> > > pointing
>> > > > > out that if we don't send StopReplicaRequest to a broker to be
>> shut
>> > > down
>> > > > in
>> > > > > a controlled way, then the leader will shrink ISR, expand it
and
>> > shrink
>> > > > it
>> > > > > again after the timeout.
>> > > > >
>> > > > > The KIP seems to still reference "
>> > > > > /broker/topics/[topic]/partitions/[partitionId]/
>> > > > controller_managed_state".
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindong28@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > Hey Jun,
>> > > > > >
>> > > > > > Thanks for the suggestion. I think it is a good idea to
know put
>> > > > created
>> > > > > > flag in ZK and simply specify isNewReplica=true in
>> > > LeaderAndIsrRequest
>> > > > if
>> > > > > > repilcas was in NewReplica state. It will only fail the
replica
>> > > > creation
>> > > > > in
>> > > > > > the scenario that the controller fails after
>> > > > > > topic-creation/partition-reassignment/partition-number-change
>> but
>> > > > before
>> > > > > > actually sends out the LeaderAndIsrRequest while there is
>> ongoing
>> > > disk
>> > > > > > failure, which should be pretty rare and acceptable. This
should
>> > > > simplify
>> > > > > > the design of this KIP.
>> > > > > >
>> > > > > > Regarding RAID-5, I think the concern with RAID-5/6 is not
just
>> > about
>> > > > > > performance when there is no failure. For example, RAID-5
can
>> > support
>> > > > up
>> > > > > to
>> > > > > > one disk failure and it takes time to rebuild disk after
one
>> disk
>> > > > > > failure. RAID 5 implementations are susceptible to system
>> failures
>> > > > > because
>> > > > > > of trends regarding array rebuild time and the chance of
drive
>> > > failure
>> > > > > > during rebuild. There is no such performance degradation
for
>> JBOD
>> > and
>> > > > > JBOD
>> > > > > > can support multiple log directory failure without reducing
>> > > performance
>> > > > > of
>> > > > > > good log directories. Would this be a reasonable reason
for
>> using
>> > > JBOD
>> > > > > > instead of RAID-5/6?
>> > > > > >
>> > > > > > Previously we discussed wether broker should remove offline
>> replica
>> > > > from
>> > > > > > replica fetcher thread. I still think it should do it instead
of
>> > > > > printing a
>> > > > > > lot of error in the log4j log. We can still let controller
send
>> > > > > > StopReplicaRequest to the broker. I am not sure I undertand
why
>> > > > allowing
>> > > > > > broker to remove offline replica from fetcher thread will
>> increase
>> > > > churns
>> > > > > > in ISR. Do you think this is concern with this approach?
>> > > > > >
>> > > > > > I have updated the KIP to remove created flag from ZK and
change
>> > the
>> > > > > filed
>> > > > > > name to isNewReplica. Can you check if there is any issue
with
>> the
>> > > > latest
>> > > > > > KIP? Thanks for your time!
>> > > > > >
>> > > > > > Regards,
>> > > > > > Dong
>> > > > > >
>> > > > > >
>> > > > > > On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <jun@confluent.io>
>> wrote:
>> > > > > >
>> > > > > > > Hi, Dong,
>> > > > > > >
>> > > > > > > Thanks for the reply.
>> > > > > > >
>> > > > > > > Personally, I'd prefer not to write the created flag
per
>> replica
>> > in
>> > > > ZK.
>> > > > > > > Your suggestion of disabling replica creation if there
is a
>> bad
>> > log
>> > > > > > > directory on the broker could work. The only thing
is that it
>> may
>> > > > delay
>> > > > > > the
>> > > > > > > creation of new replicas. I was thinking that an alternative
>> is
>> > to
>> > > > > extend
>> > > > > > > LeaderAndIsrRequest by adding a isNewReplica field
per
>> replica.
>> > > That
>> > > > > > field
>> > > > > > > will be set when a replica is transitioning from the
>> NewReplica
>> > > state
>> > > > > to
>> > > > > > > Online state. Then, when a broker receives a
>> LeaderAndIsrRequest,
>> > > if
>> > > > a
>> > > > > > > replica is marked as the new replica, it will be created
on a
>> > good
>> > > > log
>> > > > > > > directory, if not already present. Otherwise, it only
creates
>> the
>> > > > > replica
>> > > > > > > if all log directories are good and the replica is
not already
>> > > > present.
>> > > > > > > This way, we don't delay the processing of new replicas
in the
>> > > common
>> > > > > > case.
>> > > > > > >
>> > > > > > > I am ok with not persisting the offline replicas in
ZK and
>> just
>> > > > > > discovering
>> > > > > > > them through the LeaderAndIsrRequest. It handles the
cases
>> when a
>> > > > > broker
>> > > > > > > starts up with bad log directories better. So, the
additional
>> > > > overhead
>> > > > > of
>> > > > > > > rediscovering the offline replicas is justified.
>> > > > > > >
>> > > > > > >
>> > > > > > > Another high level question. The proposal rejected
RAID5/6
>> since
>> > it
>> > > > > adds
>> > > > > > > additional I/Os. The main issue with RAID5 is that
to write a
>> > block
>> > > > > that
>> > > > > > > doesn't match the RAID stripe size, we have to first
read the
>> old
>> > > > > parity
>> > > > > > to
>> > > > > > > compute the new one, which increases the number of
I/Os (
>> > > > > > > http://rickardnobel.se/raid-5-write-penalty/). I am
>> wondering if
>> > > you
>> > > > > > have
>> > > > > > > tested RAID5's performance by creating a file system
whose
>> block
>> > > size
>> > > > > > > matches the RAID stripe size (https://www.percona.com/blog/
>> > > > > > > 2011/12/16/setting-up-xfs-the-simple-edition/). This
way,
>> > writing
>> > > a
>> > > > > > block
>> > > > > > > doesn't require a read first. A large block size may
increase
>> the
>> > > > > amount
>> > > > > > of
>> > > > > > > data writes, when the same block has to be written
to disk
>> > multiple
>> > > > > > times.
>> > > > > > > However, this is probably ok in Kafka's use case since
we
>> batch
>> > the
>> > > > I/O
>> > > > > > > flush already. As you can see, we will be adding some
>> complexity
>> > to
>> > > > > > support
>> > > > > > > JBOD in Kafka one way or another. If we can tune the
>> performance
>> > of
>> > > > > RAID5
>> > > > > > > to match that of RAID10, perhaps using RAID5 is a simpler
>> > solution.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <
>> lindong28@gmail.com>
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > Hey Jun,
>> > > > > > > >
>> > > > > > > > I don't think we should allow failed replicas
to be
>> re-created
>> > on
>> > > > the
>> > > > > > > good
>> > > > > > > > disks. Say there are 2 disks and each of them
is 51%
>> loaded. If
>> > > any
>> > > > > > disk
>> > > > > > > > fail, and we allow replicas to be re-created on
the other
>> > disks,
>> > > > both
>> > > > > > > disks
>> > > > > > > > will fail. Alternatively we can disable replica
creation if
>> > there
>> > > > is
>> > > > > > bad
>> > > > > > > > disk on a broker. I personally think it is worth
the
>> additional
>> > > > > > > complexity
>> > > > > > > > in the broker to store created replicas in ZK
so that we
>> allow
>> > > new
>> > > > > > > replicas
>> > > > > > > > to be created on the broker even when there is
bad log
>> > directory.
>> > > > > This
>> > > > > > > > approach won't add complexity in the controller.
But I am
>> fine
>> > > with
>> > > > > > > > disabling replica creation when there is bad log
directory
>> that
>> > > if
>> > > > it
>> > > > > > is
>> > > > > > > > the only blocking issue for this KIP.
>> > > > > > > >
>> > > > > > > > Whether we store created flags is independent
of
>> whether/how we
>> > > > store
>> > > > > > > > offline replicas. Per our previous discussion,
do you think
>> it
>> > is
>> > > > OK
>> > > > > > not
>> > > > > > > > store offline replicas in ZK and propagate the
offline
>> replicas
>> > > > from
>> > > > > > > broker
>> > > > > > > > to controller via LeaderAndIsrRequest?
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Dong
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message