kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <...@confluent.io>
Subject Re: [DISCUSS] KIP-112: Handle disk failure for JBOD
Date Fri, 24 Feb 2017 18:02:18 GMT
Hi, Dong,

Got it. Yes, in that case, we will just fall back to the corner case
handling: (1) failed replicas will be recreated on another available log
directory; (2) If later on, the failed log directory is fixed with the
existing data, during broker startup, we will detect duplicate replicas,
fail the starting and let the admin fix the issue manually. This is not
optimal, but I am not sure if this is a common case though.

Overall, I am just wondering the need for storing created flag for each
replica. It seems to be useful only for an uncommon case. The more common
case is that a log directory fails while the broker is running. The
tradeoff is that if we don't optimize for the uncommon case, we only need
to write failed replicas, instead of successfully created replica to ZK.
The former will be a lot less than the latter.

Thanks,

Jun

On Thu, Feb 23, 2017 at 11:17 PM, Dong Lin <lindong28@gmail.com> wrote:

> Hey Jun,
>
> I think there is one simpler design that doesn't need to add "create" flag
> in LeaderAndIsrRequest and also remove the need for controller to
> track/update which replicas are created. The idea is for each broker to
> persist the created replicas in per-broker-per-topic znode. When a replica
> is created or deleted, the broker updates the znode accordingly. When
> broker receives LeaderAndIsrRequest, it learns the "create" flag from its
> cache of these znode data. When a broker starts, it does need to read # of
> znode proportional to the number of topics on its disks. But controller
> still needs to learn about offline replicas from LeaderAndIsrResponse.
>
> I think this is better than the current design. Do you have any concern
> with this design?
>
> Thanks,
> Dong
>
>
> On Thu, Feb 23, 2017 at 7:12 PM, Dong Lin <lindong28@gmail.com> wrote:
>
> > Hey Jun,
> >
> > Sure, here is my explanation.
> >
> > Design B would not work if it doesn't store created replicas in the ZK.
> > For example, say broker B is health when it is shutdown. At this moment
> no
> > offline replica is written in ZK for this broker. Suppose log directory
> is
> > damaged when broker is offline, then when this broker starts, it won't
> know
> > which replicas are in the bad log directory. And it won't be able to
> > specify those offline replicas in /failed-log-directory either.
> >
> > Let's say design B stores created replica in ZK. Then the next problem is
> > that, in the scenario that multiple log directories are damaged while
> > broker is offline, when broker starts, it won't be able to know the exact
> > list of offline replicas on each bad log directory. All it knows is the
> > offline replicas on all those bad log directories. Thus it is impossible
> > for broker to specify offline replicas per log directory in this
> scenario.
> >
> > I agree with your observation that, if admin fixes replaces dir1 with a
> > good empty disk but leave dir2 untouched, design A won't create replica
> > whereas design B can create it. But I am not sure that is a problem which
> > we want to optimize. It seems reasonable for admin to fix both log
> > directories in practice. If admin fixes only one of the two log
> > directories, we can say it is a partial fix and Kafka won't re-create any
> > offline replicas on dir1 and dir2. Similar to extra round of
> > LeaderAndIsrRequest in case of log failure, I think this is also a pretty
> > minor issue with design B.
> >
> > Thanks,
> > Dong
> >
> >
> > On Thu, Feb 23, 2017 at 6:46 PM, Jun Rao <jun@confluent.io> wrote:
> >
> >> Hi, Dong,
> >>
> >> My replies are inlined below.
> >>
> >> On Thu, Feb 23, 2017 at 4:47 PM, Dong Lin <lindong28@gmail.com> wrote:
> >>
> >> > Hey Jun,
> >> >
> >> > Thanks for you reply! Let me first comment on the things that you
> >> listed as
> >> > advantage of B over A.
> >> >
> >> > 1) No change in LeaderAndIsrRequest protocol.
> >> >
> >> > I agree with this.
> >> >
> >> > 2) Step 1. One less round of LeaderAndIsrRequest and no additional ZK
> >> > writes to record the created flag.
> >> >
> >> > I don't think this is true. There will be one round of
> >> LeaderAndIsrRequest
> >> > in both A and B. In the design A controller needs to write to ZK once
> to
> >> > record this replica as created. The design B the broker needs to write
> >> > zookeeper once to record this replica as created. So there is same
> >> number
> >> > of LeaderAndIsrRequest and ZK writes.
> >> >
> >> > Broker needs to record created replica in design B so that when it
> >> > bootstraps with failed log directory, the broker can derive the
> offline
> >> > replicas as the difference between created replicas and replicas found
> >> on
> >> > good log directories.
> >> >
> >> >
> >> Design B actually doesn't write created replicas in ZK. When a broker
> >> starts up, all offline replicas are stored in the /failed-log-directory
> >> path in ZK. So if a replica is not there and is not in the live log
> >> directories either, it's never created. Does this work?
> >>
> >>
> >>
> >> > 3) Step 2. One less round of LeaderAndIsrRequest and no additional
> >> logic to
> >> > handle LeaderAndIsrResponse.
> >> >
> >> > While I agree there is one less round of LeaderAndIsrRequest in design
> >> B, I
> >> > don't think one additional LeaderAndIsrRequest to handle log directory
> >> > failure is a big deal given that it doesn't happen frequently.
> >> >
> >> > Also, while there is no additional logic to handle
> LeaderAndIsrResponse
> >> in
> >> > design B, I actually think this is something that controller should do
> >> > anyway. Say the broker stops responding to any requests without
> removing
> >> > itself from zookeeper, the only way for controller to realize this and
> >> > re-elect leader is to send request to this broker and handle response.
> >> The
> >> > is a problem that we don't do it as of now.
> >> >
> >> > 4) Step 6. Additional ZK reads proportional to # of failed log
> >> directories,
> >> > instead of # of partitions.
> >> >
> >> > If one znode is able to describe all topic partitions in a log
> >> directory,
> >> > then the existing znode /brokers/topics/[topic] should be able to
> >> describe
> >> > created replicas in addition to the assigned replicas for every
> >> partition
> >> > of the topic. In this case, design A requires no additional ZK reads
> >> > whereas design B ZK reads proportional to # of failed log directories.
> >> >
> >> > 5) Step 3. In design A, if a broker is restarted and the failed log
> >> > directory is unreadable, the broker doesn't know which replicas are on
> >> the
> >> > failed log directory. So, when the broker receives the
> LeadAndIsrRequest
> >> > with created = false, it's bit hard for the broker to decide whether
> it
> >> > should create the missing replica on other log directories. This is
> >> easier
> >> > in design B since the list of failed replicas are persisted in ZK.
> >> >
> >> > I don't understand why it is hard for broker to make decision in
> design
> >> A.
> >> > With design A, if a broker is started with a failed log directory and
> it
> >> > receives LeaderAndIsrRequest with created=false for a replica that can
> >> not
> >> > be found on any good log directory, broker will not create this
> >> replica. Is
> >> > there any drawback with this approach?
> >> >
> >> >
> >> >
> >> I was thinking about this case. Suppose two log directories dir1 and
> dir2
> >> fail. The admin replaces dir1 with an empty new disk. The broker is
> >> restarted with dir1 alive, and dir2 still failing. Now, when receiving a
> >> LeaderAndIsrRequest including a replica that was previously in dir1, the
> >> broker won't be able to create that replica when it could.
> >>
> >>
> >>
> >> > Here is my summary of pros and cons of design B as compared to design
> A.
> >> >
> >> > pros:
> >> >
> >> > 1) No change to LeaderAndIsrRequest.
> >> > 2) One less round of LeaderAndIsrRequest in case of log directory
> >> failure.
> >> >
> >> > cons:
> >> >
> >> > 1) This is impossible for broker to figure out the log directory of
> >> offline
> >> > replicas for failed-log-directory/[directory] if multiple log
> >> directories
> >> > are unreadable when broker starts.
> >> >
> >> >
> >> Hmm, I am not sure that I get this point. If multiple log directories
> >> fail,
> >> design B stores each directory under /failed-log-directory, right?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >>
> >>
> >> > 2) The znode size limit of failed-log-directory/[directory]
> essentially
> >> > limits the number of topic partitions that can exist on a log
> >> directory. It
> >> > becomes more of a problem when a broker is configured to use multiple
> >> log
> >> > directories each of which is a RAID-10 of large capacity. While this
> may
> >> > not be a problem in practice with additional requirement (e.g. don't
> use
> >> > more than one log directory if using RAID-10), ideally we want to
> avoid
> >> > such limit.
> >> >
> >> > 3) Extra ZK read of failed-log-directory/[directory] when broker
> starts
> >> >
> >> >
> >> > My main concern with the design B is the use of znode
> >> > /brokers/ids/[brokerId]/failed-log-directory/[directory]. I don't
> >> really
> >> > think other pros/cons of design B matter to us. Does my summary make
> >> sense?
> >> >
> >> > Thanks,
> >> > Dong
> >> >
> >> >
> >> > On Thu, Feb 23, 2017 at 2:20 PM, Jun Rao <jun@confluent.io> wrote:
> >> >
> >> > > Hi, Dong,
> >> > >
> >> > > Just so that we are on the same page. Let me spec out the
> alternative
> >> > > design a bit more and then compare. Let's call the current design A
> >> and
> >> > the
> >> > > alternative design B.
> >> > >
> >> > > Design B:
> >> > >
> >> > > New ZK path
> >> > > failed log directory path (persistent): This is created by a broker
> >> when
> >> > a
> >> > > log directory fails and is potentially removed when the broker is
> >> > > restarted.
> >> > > /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json
> of
> >> the
> >> > > replicas in the log directory }.
> >> > >
> >> > > *1. Topic gets created*
> >> > > - Works the same as before.
> >> > >
> >> > > *2. A log directory stops working on a broker during runtime*
> >> > >
> >> > > - The controller watches the path /failed-log-directory for the new
> >> > znode.
> >> > >
> >> > > - The broker detects an offline log directory during runtime and
> marks
> >> > > affected replicas as offline in memory.
> >> > >
> >> > > - The broker writes the failed directory and all replicas in the
> >> failed
> >> > > directory under /failed-log-directory/directory1.
> >> > >
> >> > > - The controller reads /failed-log-directory/directory1 and stores
> in
> >> > > memory a list of failed replicas due to disk failures.
> >> > >
> >> > > - The controller moves those replicas due to disk failure to offline
> >> > state
> >> > > and triggers the state change in replica state machine.
> >> > >
> >> > >
> >> > > *3. Broker is restarted*
> >> > >
> >> > > - The broker reads /brokers/ids/[brokerId]/failed-log-directory, if
> >> any.
> >> > >
> >> > > - For each failed log directory it reads from ZK, if the log
> directory
> >> > > exists in log.dirs and is accessible now, or if the log directory no
> >> > longer
> >> > > exists in log.dirs, remove that log directory from
> >> failed-log-directory.
> >> > > Otherwise, the broker loads replicas in the failed log directory in
> >> > memory
> >> > > as offline.
> >> > >
> >> > > - The controller handles the failed log directory change event, if
> >> needed
> >> > > (same as #2).
> >> > >
> >> > > - The controller handles the broker registration event.
> >> > >
> >> > >
> >> > > *6. Controller failover*
> >> > > - Controller reads all child paths under /failed-log-directory to
> >> rebuild
> >> > > the list of failed replicas due to disk failures. Those replicas
> will
> >> be
> >> > > transitioned to the offline state during controller initialization.
> >> > >
> >> > > Comparing this with design A, I think the following are the things
> >> that
> >> > > design B simplifies.
> >> > > * No change in LeaderAndIsrRequest protocol.
> >> > > * Step 1. One less round of LeaderAndIsrRequest and no additional ZK
> >> > writes
> >> > > to record the created flag.
> >> > > * Step 2. One less round of LeaderAndIsrRequest and no additional
> >> logic
> >> > to
> >> > > handle LeaderAndIsrResponse.
> >> > > * Step 6. Additional ZK reads proportional to # of failed log
> >> > directories,
> >> > > instead of # of partitions.
> >> > > * Step 3. In design A, if a broker is restarted and the failed log
> >> > > directory is unreadable, the broker doesn't know which replicas are
> on
> >> > the
> >> > > failed log directory. So, when the broker receives the
> >> LeadAndIsrRequest
> >> > > with created = false, it's bit hard for the broker to decide whether
> >> it
> >> > > should create the missing replica on other log directories. This is
> >> > easier
> >> > > in design B since the list of failed replicas are persisted in ZK.
> >> > >
> >> > > Now, for some of the other things that you mentioned.
> >> > >
> >> > > * What happens if a log directory is renamed?
> >> > > I think this can be handled in the same way as non-existing log
> >> directory
> >> > > during broker restart.
> >> > >
> >> > > * What happens if replicas are moved manually across disks?
> >> > > Good point. Well, if all log directories are available, the failed
> log
> >> > > directory path will be cleared. In the rarer case that a log
> >> directory is
> >> > > still offline and one of the replicas registered in the failed log
> >> > > directory shows up in another available log directory, I am not
> quite
> >> > sure.
> >> > > Perhaps the simplest approach is to just error out and let the admin
> >> fix
> >> > > things manually?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jun
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Feb 22, 2017 at 3:39 PM, Dong Lin <lindong28@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hey Jun,
> >> > > >
> >> > > > Thanks much for the explanation. I have some questions about 21
> but
> >> > that
> >> > > is
> >> > > > less important than 20. 20 would require considerable change to
> the
> >> KIP
> >> > > and
> >> > > > probably requires weeks to discuss again. Thus I would like to be
> >> very
> >> > > sure
> >> > > > that we agree on the problems with the current design as you
> >> mentioned
> >> > > and
> >> > > > there is no foreseeable problem with the alternate design.
> >> > > >
> >> > > > Please see below I detail response. To summarize my points, I
> >> couldn't
> >> > > > figure out any non-trival drawback of the current design as
> >> compared to
> >> > > the
> >> > > > alternative design; and I couldn't figure out a good way to store
> >> > offline
> >> > > > replicas in the alternative design. Can you see if these points
> make
> >> > > sense?
> >> > > > Thanks in advance for your time!!
> >> > > >
> >> > > >
> >> > > > 1) The alternative design requires slightly more dependency on ZK.
> >> > While
> >> > > > both solutions store created replicas in the ZK, the alternative
> >> design
> >> > > > would also store offline replicas in ZK but the current design
> >> doesn't.
> >> > > > Thus
> >> > > >
> >> > > > 2) I am not sure that we should store offline replicas in znode
> >> > > > /brokers/ids/[brokerId]/failed-log-directory/[directory]. We
> >> probably
> >> > > > don't
> >> > > > want to expose log directory path in zookeeper based on the
> concept
> >> > that
> >> > > we
> >> > > > should only store logical information (e.g. topic, brokerId) in
> >> > > zookeeper's
> >> > > > path name. More specifically, we probably don't want to rename
> path
> >> in
> >> > > > zookeeper simply because user renamed a log director. And we
> >> probably
> >> > > don't
> >> > > > want to read/write these znode just because user manually moved
> >> > replicas
> >> > > > between log directories.
> >> > > >
> >> > > > 3) I couldn't find a good way to store offline replicas in ZK in
> the
> >> > > > alternative design. We can store this information one znode
> >> per-topic,
> >> > > > per-brokerId, or per-brokerId-topic. All these choices have their
> >> own
> >> > > > problems. If we store it in per-topic znode then multiple brokers
> >> may
> >> > > need
> >> > > > to read/write offline replicas in the same znode which is
> generally
> >> > bad.
> >> > > If
> >> > > > we store it per-brokerId then we effectively limit the maximum
> >> number
> >> > of
> >> > > > topic-partition that can be stored on a broker by the znode size
> >> limit.
> >> > > > This contradicts the idea to expand the single broker capacity by
> >> > > throwing
> >> > > > in more disks. If we store it per-brokerId-topic, then when
> >> controller
> >> > > > starts, it needs to read number of brokerId*topic znodes which may
> >> > double
> >> > > > the overall znode reads during controller startup.
> >> > > >
> >> > > > 4) The alternative design is less efficient than the current
> design
> >> in
> >> > > case
> >> > > > of log directory failure. The alternative design requires extra
> >> znode
> >> > > reads
> >> > > > in order to read offline replicas from zk while the current design
> >> > > requires
> >> > > > only one pair of LeaderAndIsrRequest and LeaderAndIsrResponse. The
> >> > extra
> >> > > > znode reads will be proportional to the number of topics on the
> >> broker
> >> > if
> >> > > > we store offline replicas per-brokerId-topic.
> >> > > >
> >> > > > 5) While I agree that the failure reporting should be done where
> the
> >> > > > failure is originated, I think the current design is consistent
> with
> >> > what
> >> > > > we are already doing. With the current design, broker will send
> >> > > > notification via zookeeper and controller will send
> >> LeaderAndIsrRequest
> >> > > to
> >> > > > broker. This is similar to how broker sends isr change
> notification
> >> and
> >> > > > controller read latest isr from broker. If we do want broker to
> >> report
> >> > > > failure directly to controller, we should probably have broker
> send
> >> RPC
> >> > > > directly to controller as it sends ControllerShutdownRequest. I
> can
> >> do
> >> > > this
> >> > > > as well.
> >> > > >
> >> > > > 6) I don't think the current design requires additional state
> >> > management
> >> > > in
> >> > > > each of the existing state handling such as topic creation or
> >> > controller
> >> > > > failover. All these existing logic should stay exactly the same
> >> except
> >> > > that
> >> > > > the controller should recognize offline replicas on the live
> broker
> >> > > instead
> >> > > > of assuming all replicas on live brokers are live. But this
> >> additional
> >> > > > change is required in both the current design and the alternate
> >> design.
> >> > > > Thus there should be no difference between current design and the
> >> > > alternate
> >> > > > design with respect to these existing state handling logic in
> >> > controller.
> >> > > >
> >> > > > 7) While I agree that the current design requires additional
> >> complexity
> >> > > in
> >> > > > the controller in order to handle LeaderAndIsrResponse and
> >> potentially
> >> > > > change partition and replica state to offline in the sate
> machines,
> >> I
> >> > > think
> >> > > > such logic is necessary in a well-designed controller either with
> >> the
> >> > > > alternate design or even without JBOD. Controller should be able
> to
> >> > > handle
> >> > > > error (e.g. ClusterAuthorizationException) in LeaderAndIsrResponse
> >> and
> >> > > > every responses in general. For example, if the controller hasn't
> >> > > received
> >> > > > LeaderAndIsrResponse after a given period if time, it probably
> means
> >> > the
> >> > > > broker has hang and the controller should consider this broker as
> >> > offline
> >> > > > and re-elect leader from other brokers. This would actually fix
> some
> >> > > > problem we have seen before at LinkedIn, where broker hangs due to
> >> > > > RAID-controller failure. In other words, I think it is a good idea
> >> for
> >> > > > controller to handle response.
> >> > > >
> >> > > > 8) I am not sure that the additional state management to handle
> >> > > > LeaderAndIsrResponse causes new types of synchronization. It is
> true
> >> > that
> >> > > > the logic is not handled ZK event handling
> >> > > > thread. But the existing ControllerShutdownRequest is also not
> >> handled
> >> > by
> >> > > > ZK event handling thread. The LeaderAndIsrReponse can be handled
> by
> >> the
> >> > > > same thread is that currently handling ControllerShutdownRequest
> so
> >> > that
> >> > > we
> >> > > > don't require new type of synchronization. Further, it should be
> >> rare
> >> > to
> >> > > > require additional synchronization to handle LeaderAndIsrReponse
> >> > because
> >> > > we
> >> > > > only need synchronization when there are offline replicas.
> >> > > >
> >> > > >
> >> > > > Thanks,
> >> > > > Dong
> >> > > >
> >> > > > On Wed, Feb 22, 2017 at 10:36 AM, Jun Rao <jun@confluent.io>
> wrote:
> >> > > >
> >> > > > > Hi, Dong, Jiangjie,
> >> > > > >
> >> > > > > 20. (1) I agree that ideally we'd like to use direct RPC for
> >> > > > > broker-to-broker communication instead of ZK. However, in the
> >> > > alternative
> >> > > > > design, the failed log directory path also serves as the
> >> persistent
> >> > > state
> >> > > > > for remembering the offline partitions. This is similar to the
> >> > > > > controller_managed_state path in your design. The difference is
> >> that
> >> > > the
> >> > > > > alternative design stores the state in fewer ZK paths, which
> helps
> >> > > reduce
> >> > > > > the controller failover time. (2) I agree that we want the
> >> controller
> >> > > to
> >> > > > be
> >> > > > > the single place to make decisions. However, intuitively, the
> >> failure
> >> > > > > reporting should be done where the failure is originated. For
> >> > example,
> >> > > > if a
> >> > > > > broker fails, the broker reports failure by de-registering from
> >> ZK.
> >> > The
> >> > > > > failed log directory path is similar in that regard. (3) I am
> not
> >> > > worried
> >> > > > > about the additional load from extra LeaderAndIsrRequest. What I
> >> > worry
> >> > > > > about is any unnecessary additional complexity in the
> controller.
> >> To
> >> > > me,
> >> > > > > the additional complexity in the current design is the
> additional
> >> > state
> >> > > > > management in each of the existing state handling (e.g., topic
> >> > > creation,
> >> > > > > controller failover, etc), and the additional synchronization
> >> since
> >> > the
> >> > > > > additional state management is not initiated from the ZK event
> >> > handling
> >> > > > > thread.
> >> > > > >
> >> > > > > 21. One of the reasons that we need to send a StopReplicaRequest
> >> to
> >> > > > offline
> >> > > > > replica is to handle controlled shutdown. In that case, a broker
> >> is
> >> > > still
> >> > > > > alive, but indicates to the controller that it plans to shut
> down.
> >> > > Being
> >> > > > > able to stop the replica in the shutting down broker reduces
> >> churns
> >> > in
> >> > > > ISR.
> >> > > > > So, for simplicity, it's probably easier to always send a
> >> > > > > StopReplicaRequest
> >> > > > > to any offline replica.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Jun
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Feb 21, 2017 at 2:37 PM, Dong Lin <lindong28@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > Hey Jun,
> >> > > > > >
> >> > > > > > Thanks much for your comments.
> >> > > > > >
> >> > > > > > I actually proposed the design to store both offline replicas
> >> and
> >> > > > created
> >> > > > > > replicas in per-broker znode before switching to the design in
> >> the
> >> > > > > current
> >> > > > > > KIP. The current design stores created replicas in
> per-partition
> >> > > znode
> >> > > > > and
> >> > > > > > transmits offline replicas via LeaderAndIsrResponse. The
> >> original
> >> > > > > solution
> >> > > > > > is roughly the same as what you suggested. The advantage of
> the
> >> > > current
> >> > > > > > solution is kind of philosophical: 1) we want to transmit data
> >> > (e.g.
> >> > > > > > offline replicas) using RPC and reduce dependency on
> zookeeper;
> >> 2)
> >> > we
> >> > > > > want
> >> > > > > > controller to be the only one that determines any state (e.g.
> >> > offline
> >> > > > > > replicas) that will be exposed to user. The advantage of the
> >> > solution
> >> > > > to
> >> > > > > > store offline replica in zookeeper is that we can save one
> >> > roundtrip
> >> > > > time
> >> > > > > > for controller to handle log directory failure. However, this
> >> extra
> >> > > > > > roundtrip time should not be a big deal since the log
> directory
> >> > > failure
> >> > > > > is
> >> > > > > > rare and inefficiency of extra latency is less of a problem
> when
> >> > > there
> >> > > > is
> >> > > > > > log directory failure.
> >> > > > > >
> >> > > > > > Do you think the two philosophical advantages of the current
> KIP
> >> > make
> >> > > > > > sense? If not, then I can switch to the original design that
> >> stores
> >> > > > > offline
> >> > > > > > replicas in zookeeper. It is actually written already. One
> >> > > disadvantage
> >> > > > > is
> >> > > > > > that we have to make non-trivial change the KIP (e.g. no
> create
> >> > flag
> >> > > in
> >> > > > > > LeaderAndIsrRequest and no created flag zookeeper) and restart
> >> this
> >> > > KIP
> >> > > > > > discussion.
> >> > > > > >
> >> > > > > > Regarding 21, it seems to me that LeaderAndIsrRequest/
> >> > > > StopReplicaRequest
> >> > > > > > only makes sense when broker can make the choice (e.g. fetch
> >> data
> >> > for
> >> > > > > this
> >> > > > > > replica or not). In the case that the log directory of the
> >> replica
> >> > is
> >> > > > > > already offline, broker have to stop fetching data for this
> >> replica
> >> > > > > > regardless of what controller tells it to do. Thus it seems
> >> cleaner
> >> > > for
> >> > > > > > broker to stop fetch data for this replica immediately. The
> >> > advantage
> >> > > > of
> >> > > > > > this solution is that the controller logic is simpler since it
> >> > > doesn't
> >> > > > > need
> >> > > > > > to send StopReplicaRequest in case of log directory failure,
> and
> >> > the
> >> > > > > log4j
> >> > > > > > log is also cleaner. Is there specific advantage of having
> >> > controller
> >> > > > > send
> >> > > > > > tells broker to stop fetching data for offline replicas?
> >> > > > > >
> >> > > > > > Regarding 22, I agree with your observation that it will
> >> happen. I
> >> > > will
> >> > > > > > update the KIP and specify that broker will exist with proper
> >> error
> >> > > > > message
> >> > > > > > in the log and user needs to manually remove partitions and
> >> restart
> >> > > the
> >> > > > > > broker.
> >> > > > > >
> >> > > > > > Thanks!
> >> > > > > > Dong
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao <jun@confluent.io>
> >> > wrote:
> >> > > > > >
> >> > > > > > > Hi, Dong,
> >> > > > > > >
> >> > > > > > > Sorry for the delay. A few more comments.
> >> > > > > > >
> >> > > > > > > 20. One complexity that I found in the current KIP is that
> the
> >> > way
> >> > > > the
> >> > > > > > > broker communicates failed replicas to the controller is
> >> > > inefficient.
> >> > > > > > When
> >> > > > > > > a log directory fails, the broker only sends an indication
> >> > through
> >> > > ZK
> >> > > > > to
> >> > > > > > > the controller and the controller has to issue a
> >> > > LeaderAndIsrRequest
> >> > > > to
> >> > > > > > > discover which replicas are offline due to log directory
> >> failure.
> >> > > An
> >> > > > > > > alternative approach is that when a log directory fails, the
> >> > broker
> >> > > > > just
> >> > > > > > > writes the failed the directory and the corresponding topic
> >> > > > partitions
> >> > > > > > in a
> >> > > > > > > new failed log directory ZK path like the following.
> >> > > > > > >
> >> > > > > > > Failed log directory path:
> >> > > > > > > /brokers/ids/[brokerId]/failed-log-directory/directory1 =>
> {
> >> > json
> >> > > of
> >> > > > > the
> >> > > > > > > topic partitions in the log directory }.
> >> > > > > > >
> >> > > > > > > The controller just watches for child changes in
> >> > > > > > > /brokers/ids/[brokerId]/failed-log-directory.
> >> > > > > > > After reading this path, the broker knows the exact set of
> >> > replicas
> >> > > > > that
> >> > > > > > > are offline and can trigger that replica state change
> >> > accordingly.
> >> > > > This
> >> > > > > > > saves an extra round of LeaderAndIsrRequest handling.
> >> > > > > > >
> >> > > > > > > With this new ZK path, we get probably get rid
> >> > > > > of/broker/topics/[topic]/
> >> > > > > > > partitions/[partitionId]/controller_managed_state. The
> >> creation
> >> > > of a
> >> > > > > new
> >> > > > > > > replica is expected to always succeed unless all log
> >> directories
> >> > > > fail,
> >> > > > > in
> >> > > > > > > which case, the broker goes down anyway. Then, during
> >> controller
> >> > > > > > failover,
> >> > > > > > > the controller just needs to additionally read from ZK the
> >> extra
> >> > > > failed
> >> > > > > > log
> >> > > > > > > directory paths, which is many fewer than topics or
> >> partitions.
> >> > > > > > >
> >> > > > > > > On broker startup, if a log directory becomes available, the
> >> > > > > > corresponding
> >> > > > > > > log directory path in ZK will be removed.
> >> > > > > > >
> >> > > > > > > The downside of this approach is that the value of this new
> ZK
> >> > path
> >> > > > can
> >> > > > > > be
> >> > > > > > > large. However, even with 5K partition per log directory and
> >> 100
> >> > > > bytes
> >> > > > > > per
> >> > > > > > > partition, the size of the value is 500KB, still less than
> the
> >> > > > default
> >> > > > > > 1MB
> >> > > > > > > znode limit in ZK.
> >> > > > > > >
> >> > > > > > > 21. "Broker will remove offline replica from its replica
> >> fetcher
> >> > > > > > threads."
> >> > > > > > > The proposal lets the broker remove the replica from the
> >> replica
> >> > > > > fetcher
> >> > > > > > > thread when it detects a directory failure. An alternative
> is
> >> to
> >> > > only
> >> > > > > do
> >> > > > > > > that until the broker receives the LeaderAndIsrRequest/
> >> > > > > > StopReplicaRequest.
> >> > > > > > > The benefit of this is that the controller is the only one
> who
> >> > > > decides
> >> > > > > > > which replica to be removed from the replica fetcher
> threads.
> >> The
> >> > > > > broker
> >> > > > > > > also doesn't need additional logic to remove the replica
> from
> >> > > replica
> >> > > > > > > fetcher threads. The downside is that in a small window, the
> >> > > replica
> >> > > > > > fetch
> >> > > > > > > thread will keep writing to the failed log directory and may
> >> > > pollute
> >> > > > > the
> >> > > > > > > log4j log.
> >> > > > > > >
> >> > > > > > > 22. In the current design, there is a potential corner case
> >> issue
> >> > > > that
> >> > > > > > the
> >> > > > > > > same partition may exist in more than one log directory at
> >> some
> >> > > > point.
> >> > > > > > > Consider the following steps: (1) a new topic t1 is created
> >> and
> >> > the
> >> > > > > > > controller sends LeaderAndIsrRequest to a broker; (2) the
> >> broker
> >> > > > > creates
> >> > > > > > > partition t1-p1 in log dir1; (3) before the broker sends a
> >> > > response,
> >> > > > it
> >> > > > > > > goes down; (4) the broker is restarted with log dir1
> >> unreadable;
> >> > > (5)
> >> > > > > the
> >> > > > > > > broker receives a new LeaderAndIsrRequest and creates
> >> partition
> >> > > t1-p1
> >> > > > > on
> >> > > > > > > log dir2; (6) at some point, the broker is restarted with
> log
> >> > dir1
> >> > > > > fixed.
> >> > > > > > > Now partition t1-p1 is in two log dirs. The alternative
> >> approach
> >> > > > that I
> >> > > > > > > suggested above may suffer from a similar corner case issue.
> >> > Since
> >> > > > this
> >> > > > > > is
> >> > > > > > > rare, if the broker detects this during broker startup, it
> can
> >> > > > probably
> >> > > > > > > just log an error and exit. The admin can remove the
> redundant
> >> > > > > partitions
> >> > > > > > > manually and then restart the broker.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > >
> >> > > > > > > Jun
> >> > > > > > >
> >> > > > > > > On Sat, Feb 18, 2017 at 9:31 PM, Dong Lin <
> >> lindong28@gmail.com>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hey Jun,
> >> > > > > > > >
> >> > > > > > > > Could you please let me know if the solutions above could
> >> > address
> >> > > > > your
> >> > > > > > > > concern? I really want to move the discussion forward.
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > > Dong
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Tue, Feb 14, 2017 at 8:17 PM, Dong Lin <
> >> lindong28@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hey Jun,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for all your help and time to discuss this KIP.
> >> When
> >> > you
> >> > > > get
> >> > > > > > the
> >> > > > > > > > > time, could you let me know if the previous answers
> >> address
> >> > the
> >> > > > > > > concern?
> >> > > > > > > > >
> >> > > > > > > > > I think the more interesting question in your last email
> >> is
> >> > > where
> >> > > > > we
> >> > > > > > > > > should store the "created" flag in ZK. I proposed the
> >> > solution
> >> > > > > that I
> >> > > > > > > > like
> >> > > > > > > > > most, i.e. store it together with the replica assignment
> >> data
> >> > > in
> >> > > > > the
> >> > > > > > > > /brokers/topics/[topic].
> >> > > > > > > > > In order to expedite discussion, let me provide another
> >> two
> >> > > ideas
> >> > > > > to
> >> > > > > > > > > address the concern just in case the first idea doesn't
> >> work:
> >> > > > > > > > >
> >> > > > > > > > > - We can avoid extra controller ZK read when there is no
> >> disk
> >> > > > > failure
> >> > > > > > > > > (95% of time?). When controller starts, it doesn't
> >> > > > > > > > > read controller_managed_state in ZK and sends
> >> > > LeaderAndIsrRequest
> >> > > > > > with
> >> > > > > > > > > "create = false". Only if LeaderAndIsrResponse shows
> >> failure
> >> > > for
> >> > > > > any
> >> > > > > > > > > replica, then controller will read
> >> controller_managed_state
> >> > for
> >> > > > > this
> >> > > > > > > > > partition and re-send LeaderAndIsrRequset with
> >> "create=true"
> >> > if
> >> > > > > this
> >> > > > > > > > > replica has not been created.
> >> > > > > > > > >
> >> > > > > > > > > - We can significantly reduce this ZK read time by
> making
> >> > > > > > > > > controller_managed_state a topic level information in
> ZK,
> >> > e.g.
> >> > > > > > > > > /brokers/topics/[topic]/state. Given that most topic has
> >> 10+
> >> > > > > > partition,
> >> > > > > > > > > the extra ZK read time should be less than 10% of the
> >> > existing
> >> > > > > total
> >> > > > > > zk
> >> > > > > > > > > read time during controller failover.
> >> > > > > > > > >
> >> > > > > > > > > Thanks!
> >> > > > > > > > > Dong
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Feb 14, 2017 at 7:30 AM, Dong Lin <
> >> > lindong28@gmail.com
> >> > > >
> >> > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> Hey Jun,
> >> > > > > > > > >>
> >> > > > > > > > >> I just realized that you may be suggesting that a tool
> >> for
> >> > > > listing
> >> > > > > > > > >> offline directories is necessary for KIP-112 by asking
> >> > whether
> >> > > > > > KIP-112
> >> > > > > > > > and
> >> > > > > > > > >> KIP-113 will be in the same release. I think such a
> tool
> >> is
> >> > > > useful
> >> > > > > > but
> >> > > > > > > > >> doesn't have to be included in KIP-112. This is because
> >> as
> >> > of
> >> > > > now
> >> > > > > > > admin
> >> > > > > > > > >> needs to log into broker machine and check broker log
> to
> >> > > figure
> >> > > > > out
> >> > > > > > > the
> >> > > > > > > > >> cause of broker failure and the bad log directory in
> >> case of
> >> > > > disk
> >> > > > > > > > failure.
> >> > > > > > > > >> The KIP-112 won't make it harder since admin can still
> >> > figure
> >> > > > out
> >> > > > > > the
> >> > > > > > > > bad
> >> > > > > > > > >> log directory by doing the same thing. Thus it is
> >> probably
> >> > OK
> >> > > to
> >> > > > > > just
> >> > > > > > > > >> include this script in KIP-113. Regardless, my hope is
> to
> >> > > finish
> >> > > > > > both
> >> > > > > > > > KIPs
> >> > > > > > > > >> ASAP and make them in the same release since both KIPs
> >> are
> >> > > > needed
> >> > > > > > for
> >> > > > > > > > the
> >> > > > > > > > >> JBOD setup.
> >> > > > > > > > >>
> >> > > > > > > > >> Thanks,
> >> > > > > > > > >> Dong
> >> > > > > > > > >>
> >> > > > > > > > >> On Mon, Feb 13, 2017 at 5:52 PM, Dong Lin <
> >> > > lindong28@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > > >>
> >> > > > > > > > >>> And the test plan has also been updated to simulate
> disk
> >> > > > failure
> >> > > > > by
> >> > > > > > > > >>> changing log directory permission to 000.
> >> > > > > > > > >>>
> >> > > > > > > > >>> On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin <
> >> > > lindong28@gmail.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > > >>>
> >> > > > > > > > >>>> Hi Jun,
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> Thanks for the reply. These comments are very
> helpful.
> >> Let
> >> > > me
> >> > > > > > answer
> >> > > > > > > > >>>> them inline.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao <
> >> > jun@confluent.io>
> >> > > > > > wrote:
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>> Hi, Dong,
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> Thanks for the reply. A few more replies and new
> >> comments
> >> > > > > below.
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin <
> >> > > > lindong28@gmail.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> > Hi Jun,
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > Thanks for the detailed comments. Please see
> answers
> >> > > > inline:
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao <
> >> > > jun@confluent.io
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > > Hi, Dong,
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > Thanks for the updated wiki. A few comments
> below.
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > 1. Topics get created
> >> > > > > > > > >>>>> > > 1.1 Instead of storing successfully created
> >> replicas
> >> > in
> >> > > > ZK,
> >> > > > > > > could
> >> > > > > > > > >>>>> we
> >> > > > > > > > >>>>> > store
> >> > > > > > > > >>>>> > > unsuccessfully created replicas in ZK? Since the
> >> > latter
> >> > > > is
> >> > > > > > less
> >> > > > > > > > >>>>> common,
> >> > > > > > > > >>>>> > it
> >> > > > > > > > >>>>> > > probably reduces the load on ZK.
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > We can store unsuccessfully created replicas in
> ZK.
> >> > But I
> >> > > > am
> >> > > > > > not
> >> > > > > > > > >>>>> sure if
> >> > > > > > > > >>>>> > that can reduce write load on ZK.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > If we want to reduce write load on ZK using by
> store
> >> > > > > > > unsuccessfully
> >> > > > > > > > >>>>> created
> >> > > > > > > > >>>>> > replicas in ZK, then broker should not write to ZK
> >> if
> >> > all
> >> > > > > > > replicas
> >> > > > > > > > >>>>> are
> >> > > > > > > > >>>>> > successfully created. It means that if
> >> > > > > > > > /broker/topics/[topic]/partiti
> >> > > > > > > > >>>>> > ons/[partitionId]/controller_managed_state
> doesn't
> >> > exist
> >> > > > in
> >> > > > > ZK
> >> > > > > > > for
> >> > > > > > > > >>>>> a given
> >> > > > > > > > >>>>> > partition, we have to assume all replicas of this
> >> > > partition
> >> > > > > > have
> >> > > > > > > > been
> >> > > > > > > > >>>>> > successfully created and send LeaderAndIsrRequest
> >> with
> >> > > > > create =
> >> > > > > > > > >>>>> false. This
> >> > > > > > > > >>>>> > becomes a problem if controller crashes before
> >> > receiving
> >> > > > > > > > >>>>> > LeaderAndIsrResponse to validate whether a replica
> >> has
> >> > > been
> >> > > > > > > > created.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > I think this approach and reduce the number of
> bytes
> >> > > stored
> >> > > > > in
> >> > > > > > > ZK.
> >> > > > > > > > >>>>> But I am
> >> > > > > > > > >>>>> > not sure if this is a concern.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> I was mostly concerned about the controller failover
> >> > time.
> >> > > > > > > Currently,
> >> > > > > > > > >>>>> the
> >> > > > > > > > >>>>> controller failover is likely dominated by the cost
> of
> >> > > > reading
> >> > > > > > > > >>>>> topic/partition level information from ZK. If we add
> >> > > another
> >> > > > > > > > partition
> >> > > > > > > > >>>>> level path in ZK, it probably will double the
> >> controller
> >> > > > > failover
> >> > > > > > > > >>>>> time. If
> >> > > > > > > > >>>>> the approach of representing the non-created
> replicas
> >> > > doesn't
> >> > > > > > work,
> >> > > > > > > > >>>>> have
> >> > > > > > > > >>>>> you considered just adding the created flag in the
> >> > > > leaderAndIsr
> >> > > > > > > path
> >> > > > > > > > >>>>> in ZK?
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>> Yes, I have considered adding the created flag in the
> >> > > > > leaderAndIsr
> >> > > > > > > > path
> >> > > > > > > > >>>> in ZK. If we were to add created flag per replica in
> >> the
> >> > > > > > > > >>>> LeaderAndIsrRequest, then it requires a lot of change
> >> in
> >> > the
> >> > > > > code
> >> > > > > > > > base.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> If we don't add created flag per replica in the
> >> > > > > > LeaderAndIsrRequest,
> >> > > > > > > > >>>> then the information in leaderAndIsr path in ZK and
> >> > > > > > > > LeaderAndIsrRequest
> >> > > > > > > > >>>> would be different. Further, the procedure for broker
> >> to
> >> > > > update
> >> > > > > > ISR
> >> > > > > > > > in ZK
> >> > > > > > > > >>>> will be a bit complicated. When leader updates
> >> > leaderAndIsr
> >> > > > path
> >> > > > > > in
> >> > > > > > > > ZK, it
> >> > > > > > > > >>>> will have to first read created flags from ZK, change
> >> isr,
> >> > > and
> >> > > > > > write
> >> > > > > > > > >>>> leaderAndIsr back to ZK. And it needs to check znode
> >> > version
> >> > > > and
> >> > > > > > > > re-try
> >> > > > > > > > >>>> write operation in ZK if controller has updated ZK
> >> during
> >> > > this
> >> > > > > > > > period. This
> >> > > > > > > > >>>> is in contrast to the current implementation where
> the
> >> > > leader
> >> > > > > > either
> >> > > > > > > > gets
> >> > > > > > > > >>>> all the information from LeaderAndIsrRequest sent by
> >> > > > controller,
> >> > > > > > or
> >> > > > > > > > >>>> determine the infromation by itself (e.g. ISR),
> before
> >> > > writing
> >> > > > > to
> >> > > > > > > > >>>> leaderAndIsr path in ZK.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> It seems to me that the above solution is a bit
> >> > complicated
> >> > > > and
> >> > > > > > not
> >> > > > > > > > >>>> clean. Thus I come up with the design in this KIP to
> >> store
> >> > > > this
> >> > > > > > > > created
> >> > > > > > > > >>>> flag in a separate zk path. The path is named
> >> > > > > > > > controller_managed_state to
> >> > > > > > > > >>>> indicate that we can store in this znode all
> >> information
> >> > > that
> >> > > > is
> >> > > > > > > > managed by
> >> > > > > > > > >>>> controller only, as opposed to ISR.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> I agree with your concern of increased ZK read time
> >> during
> >> > > > > > > controller
> >> > > > > > > > >>>> failover. How about we store the "created"
> information
> >> in
> >> > > the
> >> > > > > > > > >>>> znode /brokers/topics/[topic]? We can change that
> >> znode to
> >> > > > have
> >> > > > > > the
> >> > > > > > > > >>>> following data format:
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> {
> >> > > > > > > > >>>>   "version" : 2,
> >> > > > > > > > >>>>   "created" : {
> >> > > > > > > > >>>>     "1" : [1, 2, 3],
> >> > > > > > > > >>>>     ...
> >> > > > > > > > >>>>   }
> >> > > > > > > > >>>>   "partition" : {
> >> > > > > > > > >>>>     "1" : [1, 2, 3],
> >> > > > > > > > >>>>     ...
> >> > > > > > > > >>>>   }
> >> > > > > > > > >>>> }
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> We won't have extra zk read using this solution. It
> >> also
> >> > > seems
> >> > > > > > > > >>>> reasonable to put the partition assignment
> information
> >> > > > together
> >> > > > > > with
> >> > > > > > > > >>>> replica creation information. The latter is only
> >> changed
> >> > > once
> >> > > > > > after
> >> > > > > > > > the
> >> > > > > > > > >>>> partition is created or re-assigned.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > > 1.2 If an error is received for a follower, does
> >> the
> >> > > > > > controller
> >> > > > > > > > >>>>> eagerly
> >> > > > > > > > >>>>> > > remove it from ISR or do we just let the leader
> >> > removes
> >> > > > it
> >> > > > > > > after
> >> > > > > > > > >>>>> timeout?
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > No, Controller will not actively remove it from
> ISR.
> >> > But
> >> > > > > > > controller
> >> > > > > > > > >>>>> will
> >> > > > > > > > >>>>> > recognize it as offline replica and propagate this
> >> > > > > information
> >> > > > > > to
> >> > > > > > > > all
> >> > > > > > > > >>>>> > brokers via UpdateMetadataRequest. Each leader can
> >> use
> >> > > this
> >> > > > > > > > >>>>> information to
> >> > > > > > > > >>>>> > actively remove offline replica from ISR set. I
> have
> >> > > > updated
> >> > > > > to
> >> > > > > > > > wiki
> >> > > > > > > > >>>>> to
> >> > > > > > > > >>>>> > clarify it.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> That seems inconsistent with how the controller
> deals
> >> > with
> >> > > > > > offline
> >> > > > > > > > >>>>> replicas
> >> > > > > > > > >>>>> due to broker failures. When that happens, the
> broker
> >> > will
> >> > > > (1)
> >> > > > > > > select
> >> > > > > > > > >>>>> a new
> >> > > > > > > > >>>>> leader if the offline replica is the leader; (2)
> >> remove
> >> > the
> >> > > > > > replica
> >> > > > > > > > >>>>> from
> >> > > > > > > > >>>>> ISR if the offline replica is the follower. So,
> >> > > intuitively,
> >> > > > it
> >> > > > > > > seems
> >> > > > > > > > >>>>> that
> >> > > > > > > > >>>>> we should be doing the same thing when dealing with
> >> > offline
> >> > > > > > > replicas
> >> > > > > > > > >>>>> due to
> >> > > > > > > > >>>>> disk failure.
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> My bad. I misunderstand how the controller currently
> >> > handles
> >> > > > > > broker
> >> > > > > > > > >>>> failure and ISR change. Yes we should do the same
> thing
> >> > when
> >> > > > > > dealing
> >> > > > > > > > with
> >> > > > > > > > >>>> offline replicas here. I have updated the KIP to
> >> specify
> >> > > that,
> >> > > > > > when
> >> > > > > > > an
> >> > > > > > > > >>>> offline replica is discovered by controller, the
> >> > controller
> >> > > > > > removes
> >> > > > > > > > offline
> >> > > > > > > > >>>> replicas from ISR in the ZK and sends
> >> LeaderAndIsrRequest
> >> > > with
> >> > > > > > > > updated ISR
> >> > > > > > > > >>>> to be used by partition leaders.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > > 1.3 Similar, if an error is received for a
> leader,
> >> > > should
> >> > > > > the
> >> > > > > > > > >>>>> controller
> >> > > > > > > > >>>>> > > trigger leader election again?
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > Yes, controller will trigger leader election if
> >> leader
> >> > > > > replica
> >> > > > > > is
> >> > > > > > > > >>>>> offline.
> >> > > > > > > > >>>>> > I have updated the wiki to clarify it.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > 2. A log directory stops working on a broker
> >> during
> >> > > > > runtime:
> >> > > > > > > > >>>>> > > 2.1 It seems the broker remembers the failed
> >> > directory
> >> > > > > after
> >> > > > > > > > >>>>> hitting an
> >> > > > > > > > >>>>> > > IOException and the failed directory won't be
> used
> >> > for
> >> > > > > > creating
> >> > > > > > > > new
> >> > > > > > > > >>>>> > > partitions until the broker is restarted? If so,
> >> > could
> >> > > > you
> >> > > > > > add
> >> > > > > > > > >>>>> that to
> >> > > > > > > > >>>>> > the
> >> > > > > > > > >>>>> > > wiki.
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > Right, broker assumes a log directory to be good
> >> after
> >> > it
> >> > > > > > starts,
> >> > > > > > > > >>>>> and mark
> >> > > > > > > > >>>>> > log directory as bad once there is IOException
> when
> >> > > broker
> >> > > > > > > attempts
> >> > > > > > > > >>>>> to
> >> > > > > > > > >>>>> > access the log directory. New replicas will only
> be
> >> > > created
> >> > > > > on
> >> > > > > > > good
> >> > > > > > > > >>>>> log
> >> > > > > > > > >>>>> > directory. I just added this to the KIP.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > > 2.2 Could you be a bit more specific on how and
> >> > during
> >> > > > > which
> >> > > > > > > > >>>>> operation
> >> > > > > > > > >>>>> > the
> >> > > > > > > > >>>>> > > broker detects directory failure? Is it when the
> >> > broker
> >> > > > > hits
> >> > > > > > an
> >> > > > > > > > >>>>> > IOException
> >> > > > > > > > >>>>> > > during writes, or both reads and writes?  For
> >> > example,
> >> > > > > during
> >> > > > > > > > >>>>> broker
> >> > > > > > > > >>>>> > > startup, it only reads from each of the log
> >> > > directories,
> >> > > > if
> >> > > > > > it
> >> > > > > > > > >>>>> hits an
> >> > > > > > > > >>>>> > > IOException there, does the broker immediately
> >> mark
> >> > the
> >> > > > > > > directory
> >> > > > > > > > >>>>> as
> >> > > > > > > > >>>>> > > offline?
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > Broker marks log directory as bad once there is
> >> > > IOException
> >> > > > > > when
> >> > > > > > > > >>>>> broker
> >> > > > > > > > >>>>> > attempts to access the log directory. This
> includes
> >> > read
> >> > > > and
> >> > > > > > > write.
> >> > > > > > > > >>>>> These
> >> > > > > > > > >>>>> > operations include log append, log read, log
> >> cleaning,
> >> > > > > > watermark
> >> > > > > > > > >>>>> checkpoint
> >> > > > > > > > >>>>> > etc. If broker hits IOException when it reads from
> >> each
> >> > > of
> >> > > > > the
> >> > > > > > > log
> >> > > > > > > > >>>>> > directory during startup, it immediately mark the
> >> > > directory
> >> > > > > as
> >> > > > > > > > >>>>> offline.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > I just updated the KIP to clarify it.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > > 3. Partition reassignment: If we know a replica
> is
> >> > > > offline,
> >> > > > > > do
> >> > > > > > > we
> >> > > > > > > > >>>>> still
> >> > > > > > > > >>>>> > > want to send StopReplicaRequest to it?
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > No, controller doesn't send StopReplicaRequest for
> >> an
> >> > > > offline
> >> > > > > > > > >>>>> replica.
> >> > > > > > > > >>>>> > Controller treats this scenario in the same way
> that
> >> > > > exiting
> >> > > > > > > Kafka
> >> > > > > > > > >>>>> > implementation does when the broker of this
> replica
> >> is
> >> > > > > offline.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > 4. UpdateMetadataRequestPartitionState: For
> >> > > > > > offline_replicas,
> >> > > > > > > do
> >> > > > > > > > >>>>> they
> >> > > > > > > > >>>>> > only
> >> > > > > > > > >>>>> > > include offline replicas due to log directory
> >> > failures
> >> > > or
> >> > > > > do
> >> > > > > > > they
> >> > > > > > > > >>>>> also
> >> > > > > > > > >>>>> > > include offline replicas due to broker failure?
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > UpdateMetadataRequestPartitionState's
> >> offline_replicas
> >> > > > > include
> >> > > > > > > > >>>>> offline
> >> > > > > > > > >>>>> > replicas due to both log directory failure and
> >> broker
> >> > > > > failure.
> >> > > > > > > This
> >> > > > > > > > >>>>> is to
> >> > > > > > > > >>>>> > make the semantics of this field easier to
> >> understand.
> >> > > > Broker
> >> > > > > > can
> >> > > > > > > > >>>>> > distinguish whether a replica is offline due to
> >> broker
> >> > > > > failure
> >> > > > > > or
> >> > > > > > > > >>>>> disk
> >> > > > > > > > >>>>> > failure by checking whether a broker is live in
> the
> >> > > > > > > > >>>>> UpdateMetadataRequest.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > 5. Tools: Could we add some kind of support in
> the
> >> > tool
> >> > > > to
> >> > > > > > list
> >> > > > > > > > >>>>> offline
> >> > > > > > > > >>>>> > > directories?
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > In KIP-112 we don't have tools to list offline
> >> > > directories
> >> > > > > > > because
> >> > > > > > > > >>>>> we have
> >> > > > > > > > >>>>> > intentionally avoided exposing log directory
> >> > information
> >> > > > > (e.g.
> >> > > > > > > log
> >> > > > > > > > >>>>> > directory path) to user or other brokers. I think
> we
> >> > can
> >> > > > add
> >> > > > > > this
> >> > > > > > > > >>>>> feature
> >> > > > > > > > >>>>> > in KIP-113, in which we will have
> >> DescribeDirsRequest
> >> > to
> >> > > > list
> >> > > > > > log
> >> > > > > > > > >>>>> directory
> >> > > > > > > > >>>>> > information (e.g. partition assignment, path,
> size)
> >> > > needed
> >> > > > > for
> >> > > > > > > > >>>>> rebalance.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> Since we are introducing a new failure mode, if a
> >> replica
> >> > > > > becomes
> >> > > > > > > > >>>>> offline
> >> > > > > > > > >>>>> due to failure in log directories, the first thing
> an
> >> > admin
> >> > > > > wants
> >> > > > > > > to
> >> > > > > > > > >>>>> know
> >> > > > > > > > >>>>> is which log directories are offline from the
> broker's
> >> > > > > > perspective.
> >> > > > > > > > >>>>> So,
> >> > > > > > > > >>>>> including such a tool will be useful. Do you plan to
> >> do
> >> > > > KIP-112
> >> > > > > > and
> >> > > > > > > > >>>>> KIP-113
> >> > > > > > > > >>>>>  in the same release?
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>> Yes, I agree that including such a tool is using.
> This
> >> is
> >> > > > > probably
> >> > > > > > > > >>>> better to be added in KIP-113 because we need
> >> > > > > DescribeDirsRequest
> >> > > > > > to
> >> > > > > > > > get
> >> > > > > > > > >>>> this information. I will update KIP-113 to include
> this
> >> > > tool.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> I plan to do KIP-112 and KIP-113 separately to make
> >> each
> >> > KIP
> >> > > > and
> >> > > > > > > their
> >> > > > > > > > >>>> patch easier to review. I don't have any plan about
> >> which
> >> > > > > release
> >> > > > > > to
> >> > > > > > > > have
> >> > > > > > > > >>>> these KIPs. My plan is to both of them ASAP. Is there
> >> > > > particular
> >> > > > > > > > timeline
> >> > > > > > > > >>>> you prefer for code of these two KIPs to checked-in?
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > 6. Metrics: Could we add some metrics to show
> >> offline
> >> > > > > > > > directories?
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > Sure. I think it makes sense to have each broker
> >> report
> >> > > its
> >> > > > > > > number
> >> > > > > > > > of
> >> > > > > > > > >>>>> > offline replicas and offline log directories. The
> >> > > previous
> >> > > > > > metric
> >> > > > > > > > >>>>> was put
> >> > > > > > > > >>>>> > in KIP-113. I just added both metrics in KIP-112.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > 7. There are still references to
> >> kafka-log-dirs.sh.
> >> > Are
> >> > > > > they
> >> > > > > > > > valid?
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > My bad. I just removed this from "Changes in
> >> > Operational
> >> > > > > > > > Procedures"
> >> > > > > > > > >>>>> and
> >> > > > > > > > >>>>> > "Test Plan" in the KIP.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > 8. Do you think KIP-113 is ready for review? One
> >> > thing
> >> > > > that
> >> > > > > > > > KIP-113
> >> > > > > > > > >>>>> > > mentions during partition reassignment is to
> first
> >> > send
> >> > > > > > > > >>>>> > > LeaderAndIsrRequest, followed by
> >> > > ChangeReplicaDirRequest.
> >> > > > > It
> >> > > > > > > > seems
> >> > > > > > > > >>>>> it's
> >> > > > > > > > >>>>> > > better if the replicas are created in the right
> >> log
> >> > > > > directory
> >> > > > > > > in
> >> > > > > > > > >>>>> the
> >> > > > > > > > >>>>> > first
> >> > > > > > > > >>>>> > > place? The reason that I brought it up here is
> >> > because
> >> > > it
> >> > > > > may
> >> > > > > > > > >>>>> affect the
> >> > > > > > > > >>>>> > > protocol of LeaderAndIsrRequest.
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > Yes, KIP-113 is ready for review. The advantage of
> >> the
> >> > > > > current
> >> > > > > > > > >>>>> design is
> >> > > > > > > > >>>>> > that we can keep LeaderAndIsrRequest
> >> > > > log-direcotry-agnostic.
> >> > > > > > The
> >> > > > > > > > >>>>> > implementation would be much easier to read if all
> >> log
> >> > > > > related
> >> > > > > > > > logic
> >> > > > > > > > >>>>> (e.g.
> >> > > > > > > > >>>>> > various errors) are put in ChangeReplicadIRrequest
> >> and
> >> > > the
> >> > > > > code
> >> > > > > > > > path
> >> > > > > > > > >>>>> of
> >> > > > > > > > >>>>> > handling replica movement is separated from
> >> leadership
> >> > > > > > handling.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > In other words, I think Kafka may be easier to
> >> develop
> >> > in
> >> > > > the
> >> > > > > > > long
> >> > > > > > > > >>>>> term if
> >> > > > > > > > >>>>> > we separate these two requests.
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> > I agree that ideally we want to create replicas in
> >> the
> >> > > > right
> >> > > > > > log
> >> > > > > > > > >>>>> directory
> >> > > > > > > > >>>>> > in the first place. But I am not sure if there is
> >> any
> >> > > > > > performance
> >> > > > > > > > or
> >> > > > > > > > >>>>> > correctness concern with the existing way of
> moving
> >> it
> >> > > > after
> >> > > > > it
> >> > > > > > > is
> >> > > > > > > > >>>>> created.
> >> > > > > > > > >>>>> > Besides, does this decision affect the change
> >> proposed
> >> > in
> >> > > > > > > KIP-112?
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>> I am just wondering if you have considered including
> >> the
> >> > > log
> >> > > > > > > > directory
> >> > > > > > > > >>>>> for
> >> > > > > > > > >>>>> the replicas in the LeaderAndIsrRequest.
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>> Yeah I have thought about this idea, but only
> briefly.
> >> I
> >> > > > > rejected
> >> > > > > > > this
> >> > > > > > > > >>>> idea because log directory is broker's local
> >> information
> >> > > and I
> >> > > > > > > prefer
> >> > > > > > > > not
> >> > > > > > > > >>>> to expose local config information to the cluster
> >> through
> >> > > > > > > > >>>> LeaderAndIsrRequest.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>> 9. Could you describe when the offline replicas due
> to
> >> > log
> >> > > > > > > directory
> >> > > > > > > > >>>>> failure are removed from the replica fetch threads?
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> Yes. If the offline replica was a leader, either a
> new
> >> > > leader
> >> > > > is
> >> > > > > > > > >>>> elected or all follower brokers will stop fetching
> for
> >> > this
> >> > > > > > > > partition. If
> >> > > > > > > > >>>> the offline replica is a follower, the broker will
> stop
> >> > > > fetching
> >> > > > > > for
> >> > > > > > > > this
> >> > > > > > > > >>>> replica immediately. A broker stops fetching data
> for a
> >> > > > replica
> >> > > > > by
> >> > > > > > > > removing
> >> > > > > > > > >>>> the replica from the replica fetch threads. I have
> >> updated
> >> > > the
> >> > > > > KIP
> >> > > > > > > to
> >> > > > > > > > >>>> clarify it.
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> 10. The wiki mentioned changing the log directory
> to a
> >> > file
> >> > > > for
> >> > > > > > > > >>>>> simulating
> >> > > > > > > > >>>>> disk failure in system tests. Could we just change
> the
> >> > > > > permission
> >> > > > > > > of
> >> > > > > > > > >>>>> the
> >> > > > > > > > >>>>> log directory to 000 to simulate that?
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> Sure,
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> Thanks,
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> Jun
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>> > > Jun
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > On Fri, Feb 10, 2017 at 9:53 AM, Dong Lin <
> >> > > > > > lindong28@gmail.com
> >> > > > > > > >
> >> > > > > > > > >>>>> wrote:
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > > Hi Jun,
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > Can I replace zookeeper access with direct RPC
> >> for
> >> > > both
> >> > > > > ISR
> >> > > > > > > > >>>>> > notification
> >> > > > > > > > >>>>> > > > and disk failure notification in a future KIP,
> >> or
> >> > do
> >> > > > you
> >> > > > > > feel
> >> > > > > > > > we
> >> > > > > > > > >>>>> should
> >> > > > > > > > >>>>> > > do
> >> > > > > > > > >>>>> > > > it in this KIP?
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > Hi Eno, Grant and everyone,
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > Is there further improvement you would like to
> >> see
> >> > > with
> >> > > > > > this
> >> > > > > > > > KIP?
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > Thanks you all for the comments,
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > Dong
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > On Thu, Feb 9, 2017 at 4:45 PM, Dong Lin <
> >> > > > > > > lindong28@gmail.com>
> >> > > > > > > > >>>>> wrote:
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > > On Thu, Feb 9, 2017 at 3:37 PM, Colin
> McCabe <
> >> > > > > > > > >>>>> cmccabe@apache.org>
> >> > > > > > > > >>>>> > > wrote:
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > >> On Thu, Feb 9, 2017, at 11:40, Dong Lin
> >> wrote:
> >> > > > > > > > >>>>> > > > >> > Thanks for all the comments Colin!
> >> > > > > > > > >>>>> > > > >> >
> >> > > > > > > > >>>>> > > > >> > To answer your questions:
> >> > > > > > > > >>>>> > > > >> > - Yes, a broker will shutdown if all its
> >> log
> >> > > > > > directories
> >> > > > > > > > >>>>> are bad.
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >> That makes sense.  Can you add this to the
> >> > > writeup?
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > > Sure. This has already been added. You can
> >> find
> >> > it
> >> > > > here
> >> > > > > > > > >>>>> > > > > <https://cwiki.apache.org/confluence/pages/
> >> > > > > > > diffpagesbyversio
> >> > > > > > > > >>>>> n.action
> >> > > > > > > > >>>>> > ?
> >> > > > > > > > >>>>> > > > pageId=67638402&selectedPageVersions=9&
> >> > > > > > selectedPageVersions=
> >> > > > > > > > 10>
> >> > > > > > > > >>>>> > > > > .
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >> > - I updated the KIP to explicitly state
> >> that a
> >> > > log
> >> > > > > > > > >>>>> directory will
> >> > > > > > > > >>>>> > be
> >> > > > > > > > >>>>> > > > >> > assumed to be good until broker sees
> >> > IOException
> >> > > > > when
> >> > > > > > it
> >> > > > > > > > >>>>> tries to
> >> > > > > > > > >>>>> > > > access
> >> > > > > > > > >>>>> > > > >> > the log directory.
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >> Thanks.
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >> > - Controller doesn't explicitly know
> >> whether
> >> > > there
> >> > > > > is
> >> > > > > > > new
> >> > > > > > > > >>>>> log
> >> > > > > > > > >>>>> > > > directory
> >> > > > > > > > >>>>> > > > >> > or
> >> > > > > > > > >>>>> > > > >> > not. All controller knows is whether
> >> replicas
> >> > > are
> >> > > > > > online
> >> > > > > > > > or
> >> > > > > > > > >>>>> > offline
> >> > > > > > > > >>>>> > > > >> based
> >> > > > > > > > >>>>> > > > >> > on LeaderAndIsrResponse. According to the
> >> > > existing
> >> > > > > > Kafka
> >> > > > > > > > >>>>> > > > implementation,
> >> > > > > > > > >>>>> > > > >> > controller will always send
> >> > LeaderAndIsrRequest
> >> > > > to a
> >> > > > > > > > broker
> >> > > > > > > > >>>>> after
> >> > > > > > > > >>>>> > it
> >> > > > > > > > >>>>> > > > >> > bounces.
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >> I thought so.  It's good to clarify,
> >> though.  Do
> >> > > you
> >> > > > > > think
> >> > > > > > > > >>>>> it's
> >> > > > > > > > >>>>> > worth
> >> > > > > > > > >>>>> > > > >> adding a quick discussion of this on the
> >> wiki?
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > > Personally I don't think it is needed. If
> >> broker
> >> > > > starts
> >> > > > > > > with
> >> > > > > > > > >>>>> no bad
> >> > > > > > > > >>>>> > log
> >> > > > > > > > >>>>> > > > > directory, everything should work it is and
> we
> >> > > should
> >> > > > > not
> >> > > > > > > > need
> >> > > > > > > > >>>>> to
> >> > > > > > > > >>>>> > > clarify
> >> > > > > > > > >>>>> > > > > it. The KIP has already covered the scenario
> >> > when a
> >> > > > > > broker
> >> > > > > > > > >>>>> starts
> >> > > > > > > > >>>>> > with
> >> > > > > > > > >>>>> > > > bad
> >> > > > > > > > >>>>> > > > > log directory. Also, the KIP doesn't claim
> or
> >> > hint
> >> > > > that
> >> > > > > > we
> >> > > > > > > > >>>>> support
> >> > > > > > > > >>>>> > > > dynamic
> >> > > > > > > > >>>>> > > > > addition of new log directories. I think we
> >> are
> >> > > good.
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > >> best,
> >> > > > > > > > >>>>> > > > >> Colin
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >> >
> >> > > > > > > > >>>>> > > > >> > Please see this
> >> > > > > > > > >>>>> > > > >> > <https://cwiki.apache.org/conf
> >> > > > > > > > >>>>> luence/pages/diffpagesbyversio
> >> > > > > > > > >>>>> > > > >> n.action?pageId=67638402&
> >> > selectedPageVersions=9&
> >> > > > > > > > >>>>> > > > selectedPageVersions=10>
> >> > > > > > > > >>>>> > > > >> > for the change of the KIP.
> >> > > > > > > > >>>>> > > > >> >
> >> > > > > > > > >>>>> > > > >> > On Thu, Feb 9, 2017 at 11:04 AM, Colin
> >> McCabe
> >> > <
> >> > > > > > > > >>>>> cmccabe@apache.org
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > > >> wrote:
> >> > > > > > > > >>>>> > > > >> >
> >> > > > > > > > >>>>> > > > >> > > On Thu, Feb 9, 2017, at 11:03, Colin
> >> McCabe
> >> > > > wrote:
> >> > > > > > > > >>>>> > > > >> > > > Thanks, Dong L.
> >> > > > > > > > >>>>> > > > >> > > >
> >> > > > > > > > >>>>> > > > >> > > > Do we plan on bringing down the
> broker
> >> > > process
> >> > > > > > when
> >> > > > > > > > all
> >> > > > > > > > >>>>> log
> >> > > > > > > > >>>>> > > > >> directories
> >> > > > > > > > >>>>> > > > >> > > > are offline?
> >> > > > > > > > >>>>> > > > >> > > >
> >> > > > > > > > >>>>> > > > >> > > > Can you explicitly state on the KIP
> >> that
> >> > the
> >> > > > log
> >> > > > > > > dirs
> >> > > > > > > > >>>>> are all
> >> > > > > > > > >>>>> > > > >> considered
> >> > > > > > > > >>>>> > > > >> > > > good after the broker process is
> >> bounced?
> >> > > It
> >> > > > > > seems
> >> > > > > > > > >>>>> like an
> >> > > > > > > > >>>>> > > > >> important
> >> > > > > > > > >>>>> > > > >> > > > thing to be clear about.  Also,
> perhaps
> >> > > > discuss
> >> > > > > > how
> >> > > > > > > > the
> >> > > > > > > > >>>>> > > controller
> >> > > > > > > > >>>>> > > > >> > > > becomes aware of the newly good log
> >> > > > directories
> >> > > > > > > after
> >> > > > > > > > a
> >> > > > > > > > >>>>> broker
> >> > > > > > > > >>>>> > > > >> bounce
> >> > > > > > > > >>>>> > > > >> > > > (and whether this triggers
> >> re-election).
> >> > > > > > > > >>>>> > > > >> > >
> >> > > > > > > > >>>>> > > > >> > > I meant to write, all the log dirs
> where
> >> the
> >> > > > > broker
> >> > > > > > > can
> >> > > > > > > > >>>>> still
> >> > > > > > > > >>>>> > read
> >> > > > > > > > >>>>> > > > the
> >> > > > > > > > >>>>> > > > >> > > index and some other files.  Clearly,
> log
> >> > dirs
> >> > > > > that
> >> > > > > > > are
> >> > > > > > > > >>>>> > completely
> >> > > > > > > > >>>>> > > > >> > > inaccessible will still be considered
> bad
> >> > > after
> >> > > > a
> >> > > > > > > broker
> >> > > > > > > > >>>>> process
> >> > > > > > > > >>>>> > > > >> bounce.
> >> > > > > > > > >>>>> > > > >> > >
> >> > > > > > > > >>>>> > > > >> > > best,
> >> > > > > > > > >>>>> > > > >> > > Colin
> >> > > > > > > > >>>>> > > > >> > >
> >> > > > > > > > >>>>> > > > >> > > >
> >> > > > > > > > >>>>> > > > >> > > > +1 (non-binding) aside from that
> >> > > > > > > > >>>>> > > > >> > > >
> >> > > > > > > > >>>>> > > > >> > > >
> >> > > > > > > > >>>>> > > > >> > > >
> >> > > > > > > > >>>>> > > > >> > > > On Wed, Feb 8, 2017, at 00:47, Dong
> Lin
> >> > > wrote:
> >> > > > > > > > >>>>> > > > >> > > > > Hi all,
> >> > > > > > > > >>>>> > > > >> > > > >
> >> > > > > > > > >>>>> > > > >> > > > > Thank you all for the helpful
> >> > suggestion.
> >> > > I
> >> > > > > have
> >> > > > > > > > >>>>> updated the
> >> > > > > > > > >>>>> > > KIP
> >> > > > > > > > >>>>> > > > >> to
> >> > > > > > > > >>>>> > > > >> > > > > address
> >> > > > > > > > >>>>> > > > >> > > > > the comments received so far. See
> >> here
> >> > > > > > > > >>>>> > > > >> > > > > <https://cwiki.apache.org/conf
> >> > > > > > > > >>>>> > luence/pages/diffpagesbyversio
> >> > > > > > > > >>>>> > > > >> n.action?
> >> > > > > > > > >>>>> > > > >> > > pageId=67638402&selectedPageVe
> >> > > > > > > > >>>>> rsions=8&selectedPageVersions=
> >> > > > > > > > >>>>> > 9>to
> >> > > > > > > > >>>>> > > > >> > > > > read the changes of the KIP. Here
> is
> >> a
> >> > > > summary
> >> > > > > > of
> >> > > > > > > > >>>>> change:
> >> > > > > > > > >>>>> > > > >> > > > >
> >> > > > > > > > >>>>> > > > >> > > > > - Updated the Proposed Change
> >> section to
> >> > > > > change
> >> > > > > > > the
> >> > > > > > > > >>>>> recovery
> >> > > > > > > > >>>>> > > > >> steps.
> >> > > > > > > > >>>>> > > > >> > > After
> >> > > > > > > > >>>>> > > > >> > > > > this change, broker will also
> create
> >> > > replica
> >> > > > > as
> >> > > > > > > long
> >> > > > > > > > >>>>> as all
> >> > > > > > > > >>>>> > > log
> >> > > > > > > > >>>>> > > > >> > > > > directories
> >> > > > > > > > >>>>> > > > >> > > > > are working.
> >> > > > > > > > >>>>> > > > >> > > > > - Removed kafka-log-dirs.sh from
> this
> >> > KIP
> >> > > > > since
> >> > > > > > > user
> >> > > > > > > > >>>>> no
> >> > > > > > > > >>>>> > longer
> >> > > > > > > > >>>>> > > > >> needs to
> >> > > > > > > > >>>>> > > > >> > > > > use
> >> > > > > > > > >>>>> > > > >> > > > > it for recovery from bad disks.
> >> > > > > > > > >>>>> > > > >> > > > > - Explained how the znode
> >> > > > > > controller_managed_state
> >> > > > > > > > is
> >> > > > > > > > >>>>> > managed
> >> > > > > > > > >>>>> > > in
> >> > > > > > > > >>>>> > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > Public
> >> > > > > > > > >>>>> > > > >> > > > > interface section.
> >> > > > > > > > >>>>> > > > >> > > > > - Explained what happens during
> >> > controller
> >> > > > > > > failover,
> >> > > > > > > > >>>>> > partition
> >> > > > > > > > >>>>> > > > >> > > > > reassignment
> >> > > > > > > > >>>>> > > > >> > > > > and topic deletion in the Proposed
> >> > Change
> >> > > > > > section.
> >> > > > > > > > >>>>> > > > >> > > > > - Updated Future Work section to
> >> include
> >> > > the
> >> > > > > > > > following
> >> > > > > > > > >>>>> > > potential
> >> > > > > > > > >>>>> > > > >> > > > > improvements
> >> > > > > > > > >>>>> > > > >> > > > >   - Let broker notify controller of
> >> ISR
> >> > > > change
> >> > > > > > and
> >> > > > > > > > >>>>> disk
> >> > > > > > > > >>>>> > state
> >> > > > > > > > >>>>> > > > >> change
> >> > > > > > > > >>>>> > > > >> > > via
> >> > > > > > > > >>>>> > > > >> > > > > RPC instead of using zookeeper
> >> > > > > > > > >>>>> > > > >> > > > >   - Handle various failure
> scenarios
> >> > (e.g.
> >> > > > > slow
> >> > > > > > > > disk)
> >> > > > > > > > >>>>> on a
> >> > > > > > > > >>>>> > > > >> case-by-case
> >> > > > > > > > >>>>> > > > >> > > > > basis. For example, we may want to
> >> > detect
> >> > > > slow
> >> > > > > > > disk
> >> > > > > > > > >>>>> and
> >> > > > > > > > >>>>> > > consider
> >> > > > > > > > >>>>> > > > >> it as
> >> > > > > > > > >>>>> > > > >> > > > > offline.
> >> > > > > > > > >>>>> > > > >> > > > >   - Allow admin to mark a directory
> >> as
> >> > bad
> >> > > > so
> >> > > > > > that
> >> > > > > > > > it
> >> > > > > > > > >>>>> will
> >> > > > > > > > >>>>> > not
> >> > > > > > > > >>>>> > > > be
> >> > > > > > > > >>>>> > > > >> used.
> >> > > > > > > > >>>>> > > > >> > > > >
> >> > > > > > > > >>>>> > > > >> > > > > Thanks,
> >> > > > > > > > >>>>> > > > >> > > > > Dong
> >> > > > > > > > >>>>> > > > >> > > > >
> >> > > > > > > > >>>>> > > > >> > > > >
> >> > > > > > > > >>>>> > > > >> > > > >
> >> > > > > > > > >>>>> > > > >> > > > > On Tue, Feb 7, 2017 at 5:23 PM,
> Dong
> >> > Lin <
> >> > > > > > > > >>>>> > lindong28@gmail.com
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > >> wrote:
> >> > > > > > > > >>>>> > > > >> > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > Hey Eno,
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > Thanks much for the comment!
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > I still think the complexity
> added
> >> to
> >> > > > Kafka
> >> > > > > is
> >> > > > > > > > >>>>> justified
> >> > > > > > > > >>>>> > by
> >> > > > > > > > >>>>> > > > its
> >> > > > > > > > >>>>> > > > >> > > benefit.
> >> > > > > > > > >>>>> > > > >> > > > > > Let me provide my reasons below.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > 1) The additional logic is easy
> to
> >> > > > > understand
> >> > > > > > > and
> >> > > > > > > > >>>>> thus its
> >> > > > > > > > >>>>> > > > >> complexity
> >> > > > > > > > >>>>> > > > >> > > > > > should be reasonable.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > On the broker side, it needs to
> >> catch
> >> > > > > > exception
> >> > > > > > > > when
> >> > > > > > > > >>>>> > access
> >> > > > > > > > >>>>> > > > log
> >> > > > > > > > >>>>> > > > >> > > directory,
> >> > > > > > > > >>>>> > > > >> > > > > > mark log directory and all its
> >> > replicas
> >> > > as
> >> > > > > > > > offline,
> >> > > > > > > > >>>>> notify
> >> > > > > > > > >>>>> > > > >> > > controller by
> >> > > > > > > > >>>>> > > > >> > > > > > writing the zookeeper
> notification
> >> > path,
> >> > > > and
> >> > > > > > > > >>>>> specify error
> >> > > > > > > > >>>>> > > in
> >> > > > > > > > >>>>> > > > >> > > > > > LeaderAndIsrResponse. On the
> >> > controller
> >> > > > > side,
> >> > > > > > it
> >> > > > > > > > >>>>> will
> >> > > > > > > > >>>>> > > listener
> >> > > > > > > > >>>>> > > > >> to
> >> > > > > > > > >>>>> > > > >> > > > > > zookeeper for disk failure
> >> > notification,
> >> > > > > learn
> >> > > > > > > > about
> >> > > > > > > > >>>>> > offline
> >> > > > > > > > >>>>> > > > >> > > replicas in
> >> > > > > > > > >>>>> > > > >> > > > > > the LeaderAndIsrResponse, and
> take
> >> > > offline
> >> > > > > > > > replicas
> >> > > > > > > > >>>>> into
> >> > > > > > > > >>>>> > > > >> > > consideration when
> >> > > > > > > > >>>>> > > > >> > > > > > electing leaders. It also mark
> >> replica
> >> > > as
> >> > > > > > > created
> >> > > > > > > > in
> >> > > > > > > > >>>>> > > zookeeper
> >> > > > > > > > >>>>> > > > >> and
> >> > > > > > > > >>>>> > > > >> > > use it
> >> > > > > > > > >>>>> > > > >> > > > > > to determine whether a replica is
> >> > > created.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > That is all the logic we need to
> >> add
> >> > in
> >> > > > > > Kafka. I
> >> > > > > > > > >>>>> > personally
> >> > > > > > > > >>>>> > > > feel
> >> > > > > > > > >>>>> > > > >> > > this is
> >> > > > > > > > >>>>> > > > >> > > > > > easy to reason about.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > 2) The additional code is not
> much.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > I expect the code for KIP-112 to
> be
> >> > > around
> >> > > > > > 1100
> >> > > > > > > > >>>>> lines new
> >> > > > > > > > >>>>> > > > code.
> >> > > > > > > > >>>>> > > > >> > > Previously
> >> > > > > > > > >>>>> > > > >> > > > > > I have implemented a prototype
> of a
> >> > > > slightly
> >> > > > > > > > >>>>> different
> >> > > > > > > > >>>>> > > design
> >> > > > > > > > >>>>> > > > >> (see
> >> > > > > > > > >>>>> > > > >> > > here
> >> > > > > > > > >>>>> > > > >> > > > > > <https://docs.google.com/docum
> >> > > > > > > > >>>>> ent/d/1Izza0SBmZMVUBUt9s_
> >> > > > > > > > >>>>> > > > >> > > -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>)
> >> > > > > > > > >>>>> > > > >> > > > > > and uploaded it to github (see
> here
> >> > > > > > > > >>>>> > > > >> > > > > > <https://github.com/lindong28/
> >> > > > > kafka/tree/JBOD
> >> > > > > > > >).
> >> > > > > > > > >>>>> The
> >> > > > > > > > >>>>> > patch
> >> > > > > > > > >>>>> > > > >> changed
> >> > > > > > > > >>>>> > > > >> > > 33
> >> > > > > > > > >>>>> > > > >> > > > > > files, added 1185 lines and
> deleted
> >> > 183
> >> > > > > lines.
> >> > > > > > > The
> >> > > > > > > > >>>>> size of
> >> > > > > > > > >>>>> > > > >> prototype
> >> > > > > > > > >>>>> > > > >> > > patch
> >> > > > > > > > >>>>> > > > >> > > > > > is actually smaller than patch of
> >> > > KIP-107
> >> > > > > (see
> >> > > > > > > > here
> >> > > > > > > > >>>>> > > > >> > > > > > <https://github.com/apache/
> >> > > > kafka/pull/2476
> >> > > > > >)
> >> > > > > > > > which
> >> > > > > > > > >>>>> is
> >> > > > > > > > >>>>> > > already
> >> > > > > > > > >>>>> > > > >> > > accepted.
> >> > > > > > > > >>>>> > > > >> > > > > > The KIP-107 patch changed 49
> files,
> >> > > added
> >> > > > > 1349
> >> > > > > > > > >>>>> lines and
> >> > > > > > > > >>>>> > > > >> deleted 141
> >> > > > > > > > >>>>> > > > >> > > lines.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > 3) Comparison with
> >> > > > > > > one-broker-per-multiple-volume
> >> > > > > > > > s
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > This KIP can improve the
> >> availability
> >> > of
> >> > > > > Kafka
> >> > > > > > > in
> >> > > > > > > > >>>>> this
> >> > > > > > > > >>>>> > case
> >> > > > > > > > >>>>> > > > such
> >> > > > > > > > >>>>> > > > >> > > that one
> >> > > > > > > > >>>>> > > > >> > > > > > failed volume doesn't bring down
> >> the
> >> > > > entire
> >> > > > > > > > broker.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > 4) Comparison with
> >> > one-broker-per-volume
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > If each volume maps to multiple
> >> disks,
> >> > > > then
> >> > > > > we
> >> > > > > > > > >>>>> still have
> >> > > > > > > > >>>>> > > > >> similar
> >> > > > > > > > >>>>> > > > >> > > problem
> >> > > > > > > > >>>>> > > > >> > > > > > such that the broker will fail if
> >> any
> >> > > disk
> >> > > > > of
> >> > > > > > > the
> >> > > > > > > > >>>>> volume
> >> > > > > > > > >>>>> > > > failed.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > If each volume maps to one disk,
> it
> >> > > means
> >> > > > > that
> >> > > > > > > we
> >> > > > > > > > >>>>> need to
> >> > > > > > > > >>>>> > > > >> deploy 10
> >> > > > > > > > >>>>> > > > >> > > > > > brokers on a machine if the
> machine
> >> > has
> >> > > 10
> >> > > > > > > disks.
> >> > > > > > > > I
> >> > > > > > > > >>>>> will
> >> > > > > > > > >>>>> > > > >> explain the
> >> > > > > > > > >>>>> > > > >> > > > > > concern with this approach in
> >> order of
> >> > > > their
> >> > > > > > > > >>>>> importance.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > - It is weird if we were to tell
> >> kafka
> >> > > > user
> >> > > > > to
> >> > > > > > > > >>>>> deploy 50
> >> > > > > > > > >>>>> > > > >> brokers on a
> >> > > > > > > > >>>>> > > > >> > > > > > machine of 50 disks.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > - Either when user deploys Kafka
> >> on a
> >> > > > > > commercial
> >> > > > > > > > >>>>> cloud
> >> > > > > > > > >>>>> > > > platform
> >> > > > > > > > >>>>> > > > >> or
> >> > > > > > > > >>>>> > > > >> > > when
> >> > > > > > > > >>>>> > > > >> > > > > > user deploys their own cluster,
> the
> >> > size
> >> > > > or
> >> > > > > > > > largest
> >> > > > > > > > >>>>> disk
> >> > > > > > > > >>>>> > is
> >> > > > > > > > >>>>> > > > >> usually
> >> > > > > > > > >>>>> > > > >> > > > > > limited. There will be scenarios
> >> where
> >> > > > user
> >> > > > > > want
> >> > > > > > > > to
> >> > > > > > > > >>>>> > increase
> >> > > > > > > > >>>>> > > > >> broker
> >> > > > > > > > >>>>> > > > >> > > > > > capacity by having multiple disks
> >> per
> >> > > > > broker.
> >> > > > > > > This
> >> > > > > > > > >>>>> JBOD
> >> > > > > > > > >>>>> > KIP
> >> > > > > > > > >>>>> > > > >> makes it
> >> > > > > > > > >>>>> > > > >> > > > > > feasible without hurting
> >> availability
> >> > > due
> >> > > > to
> >> > > > > > > > single
> >> > > > > > > > >>>>> disk
> >> > > > > > > > >>>>> > > > >> failure.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > - Automatic load rebalance across
> >> > disks
> >> > > > will
> >> > > > > > be
> >> > > > > > > > >>>>> easier and
> >> > > > > > > > >>>>> > > > more
> >> > > > > > > > >>>>> > > > >> > > flexible
> >> > > > > > > > >>>>> > > > >> > > > > > if one broker has multiple disks.
> >> This
> >> > > can
> >> > > > > be
> >> > > > > > > > >>>>> future work.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > - There is performance concern
> when
> >> > you
> >> > > > > deploy
> >> > > > > > > 10
> >> > > > > > > > >>>>> broker
> >> > > > > > > > >>>>> > > vs. 1
> >> > > > > > > > >>>>> > > > >> > > broker on
> >> > > > > > > > >>>>> > > > >> > > > > > one machine. The metadata the
> >> cluster,
> >> > > > > > including
> >> > > > > > > > >>>>> > > FetchRequest,
> >> > > > > > > > >>>>> > > > >> > > > > > ProduceResponse, MetadataRequest
> >> and
> >> > so
> >> > > on
> >> > > > > > will
> >> > > > > > > > all
> >> > > > > > > > >>>>> be 10X
> >> > > > > > > > >>>>> > > > >> more. The
> >> > > > > > > > >>>>> > > > >> > > > > > packet-per-second will be 10X
> >> higher
> >> > > which
> >> > > > > may
> >> > > > > > > > limit
> >> > > > > > > > >>>>> > > > >> performance if
> >> > > > > > > > >>>>> > > > >> > > pps is
> >> > > > > > > > >>>>> > > > >> > > > > > the performance bottleneck. The
> >> number
> >> > > of
> >> > > > > > socket
> >> > > > > > > > on
> >> > > > > > > > >>>>> the
> >> > > > > > > > >>>>> > > > machine
> >> > > > > > > > >>>>> > > > >> is
> >> > > > > > > > >>>>> > > > >> > > 10X
> >> > > > > > > > >>>>> > > > >> > > > > > higher. And the number of
> >> replication
> >> > > > thread
> >> > > > > > > will
> >> > > > > > > > >>>>> be 100X
> >> > > > > > > > >>>>> > > > more.
> >> > > > > > > > >>>>> > > > >> The
> >> > > > > > > > >>>>> > > > >> > > impact
> >> > > > > > > > >>>>> > > > >> > > > > > will be more significant with
> >> > increasing
> >> > > > > > number
> >> > > > > > > of
> >> > > > > > > > >>>>> disks
> >> > > > > > > > >>>>> > per
> >> > > > > > > > >>>>> > > > >> > > machine. Thus
> >> > > > > > > > >>>>> > > > >> > > > > > it will limit Kakfa's scalability
> >> in
> >> > the
> >> > > > > long
> >> > > > > > > > term.
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > Thanks,
> >> > > > > > > > >>>>> > > > >> > > > > > Dong
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > > On Tue, Feb 7, 2017 at 1:51 AM,
> Eno
> >> > > > > Thereska <
> >> > > > > > > > >>>>> > > > >> eno.thereska@gmail.com
> >> > > > > > > > >>>>> > > > >> > > >
> >> > > > > > > > >>>>> > > > >> > > > > > wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > > > > >> Hi Dong,
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >> To simplify the discussion
> today,
> >> on
> >> > my
> >> > > > > part
> >> > > > > > > I'll
> >> > > > > > > > >>>>> zoom
> >> > > > > > > > >>>>> > into
> >> > > > > > > > >>>>> > > > one
> >> > > > > > > > >>>>> > > > >> > > thing
> >> > > > > > > > >>>>> > > > >> > > > > >> only:
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >> - I'll discuss the options
> called
> >> > > below :
> >> > > > > > > > >>>>> > > > >> "one-broker-per-disk" or
> >> > > > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks".
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >> - I completely buy the JBOD vs
> >> RAID
> >> > > > > arguments
> >> > > > > > > so
> >> > > > > > > > >>>>> there is
> >> > > > > > > > >>>>> > > no
> >> > > > > > > > >>>>> > > > >> need to
> >> > > > > > > > >>>>> > > > >> > > > > >> discuss that part for me. I buy
> it
> >> > that
> >> > > > > JBODs
> >> > > > > > > are
> >> > > > > > > > >>>>> good.
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >> I find the terminology can be
> >> > improved
> >> > > a
> >> > > > > bit.
> >> > > > > > > > >>>>> Ideally
> >> > > > > > > > >>>>> > we'd
> >> > > > > > > > >>>>> > > be
> >> > > > > > > > >>>>> > > > >> > > talking
> >> > > > > > > > >>>>> > > > >> > > > > >> about volumes, not disks. Just
> to
> >> > make
> >> > > it
> >> > > > > > clear
> >> > > > > > > > >>>>> that
> >> > > > > > > > >>>>> > Kafka
> >> > > > > > > > >>>>> > > > >> > > understand
> >> > > > > > > > >>>>> > > > >> > > > > >> volumes/directories, not
> >> individual
> >> > raw
> >> > > > > > disks.
> >> > > > > > > So
> >> > > > > > > > >>>>> by
> >> > > > > > > > >>>>> > > > >> > > > > >> "one-broker-per-few-disks" what
> I
> >> > mean
> >> > > is
> >> > > > > > that
> >> > > > > > > > the
> >> > > > > > > > >>>>> admin
> >> > > > > > > > >>>>> > > can
> >> > > > > > > > >>>>> > > > >> pool a
> >> > > > > > > > >>>>> > > > >> > > few
> >> > > > > > > > >>>>> > > > >> > > > > >> disks together to create a
> >> > > > volume/directory
> >> > > > > > and
> >> > > > > > > > >>>>> give that
> >> > > > > > > > >>>>> > > to
> >> > > > > > > > >>>>> > > > >> Kafka.
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >> The kernel of my question will
> be
> >> > that
> >> > > > the
> >> > > > > > > admin
> >> > > > > > > > >>>>> already
> >> > > > > > > > >>>>> > > has
> >> > > > > > > > >>>>> > > > >> tools
> >> > > > > > > > >>>>> > > > >> > > to 1)
> >> > > > > > > > >>>>> > > > >> > > > > >> create volumes/directories from
> a
> >> > JBOD
> >> > > > and
> >> > > > > 2)
> >> > > > > > > > >>>>> start a
> >> > > > > > > > >>>>> > > broker
> >> > > > > > > > >>>>> > > > >> on a
> >> > > > > > > > >>>>> > > > >> > > desired
> >> > > > > > > > >>>>> > > > >> > > > > >> machine and 3) assign a broker
> >> > > resources
> >> > > > > > like a
> >> > > > > > > > >>>>> > directory.
> >> > > > > > > > >>>>> > > I
> >> > > > > > > > >>>>> > > > >> claim
> >> > > > > > > > >>>>> > > > >> > > that
> >> > > > > > > > >>>>> > > > >> > > > > >> those tools are sufficient to
> >> > optimise
> >> > > > > > resource
> >> > > > > > > > >>>>> > allocation.
> >> > > > > > > > >>>>> > > > I
> >> > > > > > > > >>>>> > > > >> > > understand
> >> > > > > > > > >>>>> > > > >> > > > > >> that a broker could manage point
> >> 3)
> >> > > > itself,
> >> > > > > > ie
> >> > > > > > > > >>>>> juggle the
> >> > > > > > > > >>>>> > > > >> > > directories. My
> >> > > > > > > > >>>>> > > > >> > > > > >> question is whether the
> complexity
> >> > > added
> >> > > > to
> >> > > > > > > Kafka
> >> > > > > > > > >>>>> is
> >> > > > > > > > >>>>> > > > justified.
> >> > > > > > > > >>>>> > > > >> > > > > >> Operationally it seems to me an
> >> admin
> >> > > > will
> >> > > > > > > still
> >> > > > > > > > >>>>> have to
> >> > > > > > > > >>>>> > do
> >> > > > > > > > >>>>> > > > >> all the
> >> > > > > > > > >>>>> > > > >> > > three
> >> > > > > > > > >>>>> > > > >> > > > > >> items above.
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >> Looking forward to the
> discussion
> >> > > > > > > > >>>>> > > > >> > > > > >> Thanks
> >> > > > > > > > >>>>> > > > >> > > > > >> Eno
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >> > On 1 Feb 2017, at 17:21, Dong
> >> Lin <
> >> > > > > > > > >>>>> lindong28@gmail.com
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> > > > >> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > Hey Eno,
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > Thanks much for the review.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > I think your suggestion is to
> >> split
> >> > > > disks
> >> > > > > > of
> >> > > > > > > a
> >> > > > > > > > >>>>> machine
> >> > > > > > > > >>>>> > > into
> >> > > > > > > > >>>>> > > > >> > > multiple
> >> > > > > > > > >>>>> > > > >> > > > > >> disk
> >> > > > > > > > >>>>> > > > >> > > > > >> > sets and run one broker per
> disk
> >> > set.
> >> > > > > Yeah
> >> > > > > > > this
> >> > > > > > > > >>>>> is
> >> > > > > > > > >>>>> > > similar
> >> > > > > > > > >>>>> > > > to
> >> > > > > > > > >>>>> > > > >> > > Colin's
> >> > > > > > > > >>>>> > > > >> > > > > >> > suggestion of
> >> one-broker-per-disk,
> >> > > > which
> >> > > > > we
> >> > > > > > > > have
> >> > > > > > > > >>>>> > > evaluated
> >> > > > > > > > >>>>> > > > at
> >> > > > > > > > >>>>> > > > >> > > LinkedIn
> >> > > > > > > > >>>>> > > > >> > > > > >> and
> >> > > > > > > > >>>>> > > > >> > > > > >> > considered it to be a good
> short
> >> > term
> >> > > > > > > approach.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > As of now I don't think any of
> >> > these
> >> > > > > > approach
> >> > > > > > > > is
> >> > > > > > > > >>>>> a
> >> > > > > > > > >>>>> > better
> >> > > > > > > > >>>>> > > > >> > > alternative in
> >> > > > > > > > >>>>> > > > >> > > > > >> > the long term. I will
> summarize
> >> > these
> >> > > > > > here. I
> >> > > > > > > > >>>>> have put
> >> > > > > > > > >>>>> > > > these
> >> > > > > > > > >>>>> > > > >> > > reasons in
> >> > > > > > > > >>>>> > > > >> > > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> > KIP's motivation section and
> >> > rejected
> >> > > > > > > > alternative
> >> > > > > > > > >>>>> > > section.
> >> > > > > > > > >>>>> > > > I
> >> > > > > > > > >>>>> > > > >> am
> >> > > > > > > > >>>>> > > > >> > > happy to
> >> > > > > > > > >>>>> > > > >> > > > > >> > discuss more and I would
> >> certainly
> >> > > like
> >> > > > > to
> >> > > > > > > use
> >> > > > > > > > an
> >> > > > > > > > >>>>> > > > alternative
> >> > > > > > > > >>>>> > > > >> > > solution
> >> > > > > > > > >>>>> > > > >> > > > > >> that
> >> > > > > > > > >>>>> > > > >> > > > > >> > is easier to do with better
> >> > > > performance.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-10: if we
> switch
> >> > from
> >> > > > > > RAID-10
> >> > > > > > > > >>>>> with
> >> > > > > > > > >>>>> > > > >> > > > > >> replication-factoer=2 to
> >> > > > > > > > >>>>> > > > >> > > > > >> > JBOD with replicatio-factor=3,
> >> we
> >> > get
> >> > > > 25%
> >> > > > > > > > >>>>> reduction in
> >> > > > > > > > >>>>> > > disk
> >> > > > > > > > >>>>> > > > >> usage
> >> > > > > > > > >>>>> > > > >> > > and
> >> > > > > > > > >>>>> > > > >> > > > > >> > doubles the tolerance of
> broker
> >> > > failure
> >> > > > > > > before
> >> > > > > > > > >>>>> data
> >> > > > > > > > >>>>> > > > >> > > unavailability from
> >> > > > > > > > >>>>> > > > >> > > > > >> 1
> >> > > > > > > > >>>>> > > > >> > > > > >> > to 2. This is pretty huge gain
> >> for
> >> > > any
> >> > > > > > > company
> >> > > > > > > > >>>>> that
> >> > > > > > > > >>>>> > uses
> >> > > > > > > > >>>>> > > > >> Kafka at
> >> > > > > > > > >>>>> > > > >> > > large
> >> > > > > > > > >>>>> > > > >> > > > > >> > scale.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs.
> one-broker-per-disk:
> >> The
> >> > > > > benefit
> >> > > > > > > of
> >> > > > > > > > >>>>> > > > >> > > one-broker-per-disk is
> >> > > > > > > > >>>>> > > > >> > > > > >> that
> >> > > > > > > > >>>>> > > > >> > > > > >> > no major code change is needed
> >> in
> >> > > > Kafka.
> >> > > > > > > Among
> >> > > > > > > > >>>>> the
> >> > > > > > > > >>>>> > > > >> disadvantage of
> >> > > > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk summarized
> >> in
> >> > the
> >> > > > KIP
> >> > > > > > and
> >> > > > > > > > >>>>> previous
> >> > > > > > > > >>>>> > > > email
> >> > > > > > > > >>>>> > > > >> with
> >> > > > > > > > >>>>> > > > >> > > Colin,
> >> > > > > > > > >>>>> > > > >> > > > > >> > the biggest one is the 15%
> >> > throughput
> >> > > > > loss
> >> > > > > > > > >>>>> compared to
> >> > > > > > > > >>>>> > > JBOD
> >> > > > > > > > >>>>> > > > >> and
> >> > > > > > > > >>>>> > > > >> > > less
> >> > > > > > > > >>>>> > > > >> > > > > >> > flexibility to balance across
> >> > disks.
> >> > > > > > Further,
> >> > > > > > > > it
> >> > > > > > > > >>>>> > probably
> >> > > > > > > > >>>>> > > > >> requires
> >> > > > > > > > >>>>> > > > >> > > > > >> change
> >> > > > > > > > >>>>> > > > >> > > > > >> > to internal deployment tools
> at
> >> > > various
> >> > > > > > > > >>>>> companies to
> >> > > > > > > > >>>>> > deal
> >> > > > > > > > >>>>> > > > >> with
> >> > > > > > > > >>>>> > > > >> > > > > >> > one-broker-per-disk setup.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs. RAID-0: This is the
> >> > setup
> >> > > > that
> >> > > > > > > used
> >> > > > > > > > at
> >> > > > > > > > >>>>> > > > Microsoft.
> >> > > > > > > > >>>>> > > > >> The
> >> > > > > > > > >>>>> > > > >> > > > > >> problem is
> >> > > > > > > > >>>>> > > > >> > > > > >> > that a broker becomes
> >> unavailable
> >> > if
> >> > > > any
> >> > > > > > disk
> >> > > > > > > > >>>>> fail.
> >> > > > > > > > >>>>> > > Suppose
> >> > > > > > > > >>>>> > > > >> > > > > >> > replication-factor=2 and there
> >> are
> >> > 10
> >> > > > > disks
> >> > > > > > > per
> >> > > > > > > > >>>>> > machine.
> >> > > > > > > > >>>>> > > > >> Then the
> >> > > > > > > > >>>>> > > > >> > > > > >> > probability of of any message
> >> > becomes
> >> > > > > > > > >>>>> unavailable due
> >> > > > > > > > >>>>> > to
> >> > > > > > > > >>>>> > > > disk
> >> > > > > > > > >>>>> > > > >> > > failure
> >> > > > > > > > >>>>> > > > >> > > > > >> with
> >> > > > > > > > >>>>> > > > >> > > > > >> > RAID-0 is 100X higher than
> that
> >> > with
> >> > > > > JBOD.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > - JBOD vs.
> >> > one-broker-per-few-disks:
> >> > > > > > > > >>>>> > > > one-broker-per-few-disk
> >> > > > > > > > >>>>> > > > >> is
> >> > > > > > > > >>>>> > > > >> > > > > >> somewhere
> >> > > > > > > > >>>>> > > > >> > > > > >> > between one-broker-per-disk
> and
> >> > > RAID-0.
> >> > > > > So
> >> > > > > > it
> >> > > > > > > > >>>>> carries
> >> > > > > > > > >>>>> > an
> >> > > > > > > > >>>>> > > > >> averaged
> >> > > > > > > > >>>>> > > > >> > > > > >> > disadvantages of these two
> >> > > approaches.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > To answer your question
> >> regarding,
> >> > I
> >> > > > > think
> >> > > > > > it
> >> > > > > > > > is
> >> > > > > > > > >>>>> > > reasonable
> >> > > > > > > > >>>>> > > > >> to
> >> > > > > > > > >>>>> > > > >> > > mange
> >> > > > > > > > >>>>> > > > >> > > > > >> disk
> >> > > > > > > > >>>>> > > > >> > > > > >> > in Kafka. By "managing disks"
> we
> >> > mean
> >> > > > the
> >> > > > > > > > >>>>> management of
> >> > > > > > > > >>>>> > > > >> > > assignment of
> >> > > > > > > > >>>>> > > > >> > > > > >> > replicas across disks. Here
> are
> >> my
> >> > > > > reasons
> >> > > > > > in
> >> > > > > > > > >>>>> more
> >> > > > > > > > >>>>> > > detail:
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > - I don't think this KIP is a
> >> big
> >> > > step
> >> > > > > > > change.
> >> > > > > > > > By
> >> > > > > > > > >>>>> > > allowing
> >> > > > > > > > >>>>> > > > >> user to
> >> > > > > > > > >>>>> > > > >> > > > > >> > configure Kafka to run
> multiple
> >> log
> >> > > > > > > directories
> >> > > > > > > > >>>>> or
> >> > > > > > > > >>>>> > disks
> >> > > > > > > > >>>>> > > as
> >> > > > > > > > >>>>> > > > >> of
> >> > > > > > > > >>>>> > > > >> > > now, it
> >> > > > > > > > >>>>> > > > >> > > > > >> is
> >> > > > > > > > >>>>> > > > >> > > > > >> > implicit that Kafka manages
> >> disks.
> >> > It
> >> > > > is
> >> > > > > > just
> >> > > > > > > > >>>>> not a
> >> > > > > > > > >>>>> > > > complete
> >> > > > > > > > >>>>> > > > >> > > feature.
> >> > > > > > > > >>>>> > > > >> > > > > >> > Microsoft and probably other
> >> > > companies
> >> > > > > are
> >> > > > > > > > using
> >> > > > > > > > >>>>> this
> >> > > > > > > > >>>>> > > > feature
> >> > > > > > > > >>>>> > > > >> > > under the
> >> > > > > > > > >>>>> > > > >> > > > > >> > undesirable effect that a
> broker
> >> > will
> >> > > > > fail
> >> > > > > > > any
> >> > > > > > > > >>>>> if any
> >> > > > > > > > >>>>> > > disk
> >> > > > > > > > >>>>> > > > >> fail.
> >> > > > > > > > >>>>> > > > >> > > It is
> >> > > > > > > > >>>>> > > > >> > > > > >> good
> >> > > > > > > > >>>>> > > > >> > > > > >> > to complete this feature.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > - I think it is reasonable to
> >> > manage
> >> > > > disk
> >> > > > > > in
> >> > > > > > > > >>>>> Kafka. One
> >> > > > > > > > >>>>> > > of
> >> > > > > > > > >>>>> > > > >> the
> >> > > > > > > > >>>>> > > > >> > > most
> >> > > > > > > > >>>>> > > > >> > > > > >> > important work that Kafka is
> >> doing
> >> > is
> >> > > > to
> >> > > > > > > > >>>>> determine the
> >> > > > > > > > >>>>> > > > >> replica
> >> > > > > > > > >>>>> > > > >> > > > > >> assignment
> >> > > > > > > > >>>>> > > > >> > > > > >> > across brokers and make sure
> >> enough
> >> > > > > copies
> >> > > > > > > of a
> >> > > > > > > > >>>>> given
> >> > > > > > > > >>>>> > > > >> replica is
> >> > > > > > > > >>>>> > > > >> > > > > >> available.
> >> > > > > > > > >>>>> > > > >> > > > > >> > I would argue that it is not
> >> much
> >> > > > > different
> >> > > > > > > > than
> >> > > > > > > > >>>>> > > > determining
> >> > > > > > > > >>>>> > > > >> the
> >> > > > > > > > >>>>> > > > >> > > replica
> >> > > > > > > > >>>>> > > > >> > > > > >> > assignment across disk
> >> > conceptually.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > - I would agree that this KIP
> is
> >> > > > improve
> >> > > > > > > > >>>>> performance of
> >> > > > > > > > >>>>> > > > >> Kafka at
> >> > > > > > > > >>>>> > > > >> > > the
> >> > > > > > > > >>>>> > > > >> > > > > >> cost
> >> > > > > > > > >>>>> > > > >> > > > > >> > of more complexity inside
> >> Kafka, by
> >> > > > > > switching
> >> > > > > > > > >>>>> from
> >> > > > > > > > >>>>> > > RAID-10
> >> > > > > > > > >>>>> > > > to
> >> > > > > > > > >>>>> > > > >> > > JBOD. I
> >> > > > > > > > >>>>> > > > >> > > > > >> would
> >> > > > > > > > >>>>> > > > >> > > > > >> > argue that this is a right
> >> > direction.
> >> > > > If
> >> > > > > we
> >> > > > > > > can
> >> > > > > > > > >>>>> gain
> >> > > > > > > > >>>>> > 20%+
> >> > > > > > > > >>>>> > > > >> > > performance by
> >> > > > > > > > >>>>> > > > >> > > > > >> > managing NIC in Kafka as
> >> compared
> >> > to
> >> > > > > > existing
> >> > > > > > > > >>>>> approach
> >> > > > > > > > >>>>> > > and
> >> > > > > > > > >>>>> > > > >> other
> >> > > > > > > > >>>>> > > > >> > > > > >> > alternatives, I would say we
> >> should
> >> > > > just
> >> > > > > do
> >> > > > > > > it.
> >> > > > > > > > >>>>> Such a
> >> > > > > > > > >>>>> > > gain
> >> > > > > > > > >>>>> > > > >> in
> >> > > > > > > > >>>>> > > > >> > > > > >> performance,
> >> > > > > > > > >>>>> > > > >> > > > > >> > or equivalently reduction in
> >> cost,
> >> > > can
> >> > > > > save
> >> > > > > > > > >>>>> millions of
> >> > > > > > > > >>>>> > > > >> dollars
> >> > > > > > > > >>>>> > > > >> > > per year
> >> > > > > > > > >>>>> > > > >> > > > > >> > for any company running Kafka
> at
> >> > > large
> >> > > > > > scale.
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > Thanks,
> >> > > > > > > > >>>>> > > > >> > > > > >> > Dong
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> > On Wed, Feb 1, 2017 at 5:41
> AM,
> >> Eno
> >> > > > > > Thereska
> >> > > > > > > <
> >> > > > > > > > >>>>> > > > >> > > eno.thereska@gmail.com>
> >> > > > > > > > >>>>> > > > >> > > > > >> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >
> >> > > > > > > > >>>>> > > > >> > > > > >> >> I'm coming somewhat late to
> the
> >> > > > > > discussion,
> >> > > > > > > > >>>>> apologies
> >> > > > > > > > >>>>> > > for
> >> > > > > > > > >>>>> > > > >> that.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>
> >> > > > > > > > >>>>> > > > >> > > > > >> >> I'm worried about this
> >> proposal.
> >> > > It's
> >> > > > > > moving
> >> > > > > > > > >>>>> Kafka to
> >> > > > > > > > >>>>> > a
> >> > > > > > > > >>>>> > > > >> world
> >> > > > > > > > >>>>> > > > >> > > where it
> >> > > > > > > > >>>>> > > > >> > > > > >> >> manages disks. So in a sense,
> >> the
> >> > > > scope
> >> > > > > of
> >> > > > > > > the
> >> > > > > > > > >>>>> KIP is
> >> > > > > > > > >>>>> > > > >> limited,
> >> > > > > > > > >>>>> > > > >> > > but the
> >> > > > > > > > >>>>> > > > >> > > > > >> >> direction it sets for Kafka
> is
> >> > > quite a
> >> > > > > big
> >> > > > > > > > step
> >> > > > > > > > >>>>> > change.
> >> > > > > > > > >>>>> > > > >> > > Fundamentally
> >> > > > > > > > >>>>> > > > >> > > > > >> this
> >> > > > > > > > >>>>> > > > >> > > > > >> >> is about balancing resources
> >> for a
> >> > > > Kafka
> >> > > > > > > > >>>>> broker. This
> >> > > > > > > > >>>>> > > can
> >> > > > > > > > >>>>> > > > be
> >> > > > > > > > >>>>> > > > >> > > done by a
> >> > > > > > > > >>>>> > > > >> > > > > >> >> tool, rather than by changing
> >> > Kafka.
> >> > > > > E.g.,
> >> > > > > > > the
> >> > > > > > > > >>>>> tool
> >> > > > > > > > >>>>> > > would
> >> > > > > > > > >>>>> > > > >> take a
> >> > > > > > > > >>>>> > > > >> > > bunch
> >> > > > > > > > >>>>> > > > >> > > > > >> of
> >> > > > > > > > >>>>> > > > >> > > > > >> >> disks together, create a
> volume
> >> > over
> >> > > > > them
> >> > > > > > > and
> >> > > > > > > > >>>>> export
> >> > > > > > > > >>>>> > > that
> >> > > > > > > > >>>>> > > > >> to a
> >> > > > > > > > >>>>> > > > >> > > Kafka
> >> > > > > > > > >>>>> > > > >> > > > > >> broker
> >> > > > > > > > >>>>> > > > >> > > > > >> >> (in addition to setting the
> >> memory
> >> > > > > limits
> >> > > > > > > for
> >> > > > > > > > >>>>> that
> >> > > > > > > > >>>>> > > broker
> >> > > > > > > > >>>>> > > > or
> >> > > > > > > > >>>>> > > > >> > > limiting
> >> > > > > > > > >>>>> > > > >> > > > > >> other
> >> > > > > > > > >>>>> > > > >> > > > > >> >> resources). A different bunch
> >> of
> >> > > disks
> >> > > > > can
> >> > > > > > > > then
> >> > > > > > > > >>>>> make
> >> > > > > > > > >>>>> > up
> >> > > > > > > > >>>>> > > a
> >> > > > > > > > >>>>> > > > >> second
> >> > > > > > > > >>>>> > > > >> > > > > >> volume,
> >> > > > > > > > >>>>> > > > >> > > > > >> >> and be used by another Kafka
> >> > broker.
> >> > > > > This
> >> > > > > > is
> >> > > > > > > > >>>>> aligned
> >> > > > > > > > >>>>> > > with
> >> > > > > > > > >>>>> > > > >> what
> >> > > > > > > > >>>>> > > > >> > > Colin is
> >> > > > > > > > >>>>> > > > >> > > > > >> >> saying (as I understand it).
> >> > > > > > > > >>>>> > > > >> > > > > >> >>
> >> > > > > > > > >>>>> > > > >> > > > > >> >> Disks are not the only
> resource
> >> > on a
> >> > > > > > > machine,
> >> > > > > > > > >>>>> there
> >> > > > > > > > >>>>> > are
> >> > > > > > > > >>>>> > > > >> several
> >> > > > > > > > >>>>> > > > >> > > > > >> instances
> >> > > > > > > > >>>>> > > > >> > > > > >> >> where multiple NICs are used
> >> for
> >> > > > > example.
> >> > > > > > Do
> >> > > > > > > > we
> >> > > > > > > > >>>>> want
> >> > > > > > > > >>>>> > > fine
> >> > > > > > > > >>>>> > > > >> grained
> >> > > > > > > > >>>>> > > > >> > > > > >> >> management of all these
> >> resources?
> >> > > I'd
> >> > > > > > argue
> >> > > > > > > > >>>>> that
> >> > > > > > > > >>>>> > opens
> >> > > > > > > > >>>>> > > us
> >> > > > > > > > >>>>> > > > >> the
> >> > > > > > > > >>>>> > > > >> > > system
> >> > > > > > > > >>>>> > > > >> > > > > >> to a
> >> > > > > > > > >>>>> > > > >> > > > > >> >> lot of complexity.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>
> >> > > > > > > > >>>>> > > > >> > > > > >> >> Thanks
> >> > > > > > > > >>>>> > > > >> > > > > >> >> Eno
> >> > > > > > > > >>>>> > > > >> > > > > >> >>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>> On 1 Feb 2017, at 01:53,
> Dong
> >> > Lin <
> >> > > > > > > > >>>>> > lindong28@gmail.com
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > > > >> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>> Hi all,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>> I am going to initiate the
> >> vote
> >> > If
> >> > > > > there
> >> > > > > > is
> >> > > > > > > > no
> >> > > > > > > > >>>>> > further
> >> > > > > > > > >>>>> > > > >> concern
> >> > > > > > > > >>>>> > > > >> > > with
> >> > > > > > > > >>>>> > > > >> > > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> >> KIP.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>> Thanks,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>> Dong
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>> On Fri, Jan 27, 2017 at 8:08
> >> PM,
> >> > > > radai
> >> > > > > <
> >> > > > > > > > >>>>> > > > >> > > radai.rosenblatt@gmail.com>
> >> > > > > > > > >>>>> > > > >> > > > > >> >> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> a few extra points:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 1. broker per disk might
> also
> >> > > incur
> >> > > > > more
> >> > > > > > > > >>>>> client <-->
> >> > > > > > > > >>>>> > > > >> broker
> >> > > > > > > > >>>>> > > > >> > > sockets:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> suppose every producer /
> >> > consumer
> >> > > > > > "talks"
> >> > > > > > > to
> >> > > > > > > > >>>>> >1
> >> > > > > > > > >>>>> > > > partition,
> >> > > > > > > > >>>>> > > > >> > > there's a
> >> > > > > > > > >>>>> > > > >> > > > > >> >> very
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> good chance that partitions
> >> that
> >> > > > were
> >> > > > > > > > >>>>> co-located on
> >> > > > > > > > >>>>> > a
> >> > > > > > > > >>>>> > > > >> single
> >> > > > > > > > >>>>> > > > >> > > 10-disk
> >> > > > > > > > >>>>> > > > >> > > > > >> >> broker
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> would now be split between
> >> > several
> >> > > > > > > > single-disk
> >> > > > > > > > >>>>> > broker
> >> > > > > > > > >>>>> > > > >> > > processes on
> >> > > > > > > > >>>>> > > > >> > > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> >> same
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> machine. hard to put a
> >> > multiplier
> >> > > on
> >> > > > > > this,
> >> > > > > > > > but
> >> > > > > > > > >>>>> > likely
> >> > > > > > > > >>>>> > > > >x1.
> >> > > > > > > > >>>>> > > > >> > > sockets
> >> > > > > > > > >>>>> > > > >> > > > > >> are a
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> limited resource at the OS
> >> level
> >> > > and
> >> > > > > > incur
> >> > > > > > > > >>>>> some
> >> > > > > > > > >>>>> > memory
> >> > > > > > > > >>>>> > > > >> cost
> >> > > > > > > > >>>>> > > > >> > > (kernel
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> buffers)
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 2. there's a memory
> overhead
> >> to
> >> > > > > spinning
> >> > > > > > > up
> >> > > > > > > > a
> >> > > > > > > > >>>>> JVM
> >> > > > > > > > >>>>> > > > >> (compiled
> >> > > > > > > > >>>>> > > > >> > > code and
> >> > > > > > > > >>>>> > > > >> > > > > >> >> byte
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> code objects etc). if we
> >> assume
> >> > > this
> >> > > > > > > > overhead
> >> > > > > > > > >>>>> is
> >> > > > > > > > >>>>> > ~300
> >> > > > > > > > >>>>> > > MB
> >> > > > > > > > >>>>> > > > >> > > (order of
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> magnitude, specifics vary)
> >> than
> >> > > > > spinning
> >> > > > > > > up
> >> > > > > > > > >>>>> 10 JVMs
> >> > > > > > > > >>>>> > > > would
> >> > > > > > > > >>>>> > > > >> lose
> >> > > > > > > > >>>>> > > > >> > > you 3
> >> > > > > > > > >>>>> > > > >> > > > > >> GB
> >> > > > > > > > >>>>> > > > >> > > > > >> >> of
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> RAM. not a ton, but non
> >> > > negligible.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 3. there would also be some
> >> > > overhead
> >> > > > > > > > >>>>> downstream of
> >> > > > > > > > >>>>> > > kafka
> >> > > > > > > > >>>>> > > > >> in any
> >> > > > > > > > >>>>> > > > >> > > > > >> >> management
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> / monitoring / log
> >> aggregation
> >> > > > system.
> >> > > > > > > > likely
> >> > > > > > > > >>>>> less
> >> > > > > > > > >>>>> > > than
> >> > > > > > > > >>>>> > > > >> x10
> >> > > > > > > > >>>>> > > > >> > > though.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 4. (related to above) -
> added
> >> > > > > complexity
> >> > > > > > > of
> >> > > > > > > > >>>>> > > > administration
> >> > > > > > > > >>>>> > > > >> > > with more
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> running instances.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> is anyone running kafka
> with
> >> > > > anywhere
> >> > > > > > near
> >> > > > > > > > >>>>> 100GB
> >> > > > > > > > >>>>> > > heaps?
> >> > > > > > > > >>>>> > > > i
> >> > > > > > > > >>>>> > > > >> > > thought the
> >> > > > > > > > >>>>> > > > >> > > > > >> >> point
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> was to rely on kernel page
> >> cache
> >> > > to
> >> > > > do
> >> > > > > > the
> >> > > > > > > > >>>>> disk
> >> > > > > > > > >>>>> > > > buffering
> >> > > > > > > > >>>>> > > > >> ....
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> On Thu, Jan 26, 2017 at
> 11:00
> >> > AM,
> >> > > > Dong
> >> > > > > > > Lin <
> >> > > > > > > > >>>>> > > > >> > > lindong28@gmail.com>
> >> > > > > > > > >>>>> > > > >> > > > > >> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Hey Colin,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Thanks much for the
> comment.
> >> > > Please
> >> > > > > see
> >> > > > > > > me
> >> > > > > > > > >>>>> comment
> >> > > > > > > > >>>>> > > > >> inline.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> On Thu, Jan 26, 2017 at
> >> 10:15
> >> > AM,
> >> > > > > Colin
> >> > > > > > > > >>>>> McCabe <
> >> > > > > > > > >>>>> > > > >> > > cmccabe@apache.org>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> On Wed, Jan 25, 2017, at
> >> > 13:50,
> >> > > > Dong
> >> > > > > > Lin
> >> > > > > > > > >>>>> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Hey Colin,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Good point! Yeah we have
> >> > > actually
> >> > > > > > > > >>>>> considered and
> >> > > > > > > > >>>>> > > > >> tested this
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> solution,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> which we call
> >> > > > one-broker-per-disk.
> >> > > > > It
> >> > > > > > > > >>>>> would work
> >> > > > > > > > >>>>> > > and
> >> > > > > > > > >>>>> > > > >> should
> >> > > > > > > > >>>>> > > > >> > > > > >> require
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> no
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> major change in Kafka as
> >> > > compared
> >> > > > > to
> >> > > > > > > this
> >> > > > > > > > >>>>> JBOD
> >> > > > > > > > >>>>> > KIP.
> >> > > > > > > > >>>>> > > > So
> >> > > > > > > > >>>>> > > > >> it
> >> > > > > > > > >>>>> > > > >> > > would
> >> > > > > > > > >>>>> > > > >> > > > > >> be a
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> good
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> short term solution.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> But it has a few
> drawbacks
> >> > > which
> >> > > > > > makes
> >> > > > > > > it
> >> > > > > > > > >>>>> less
> >> > > > > > > > >>>>> > > > >> desirable in
> >> > > > > > > > >>>>> > > > >> > > the
> >> > > > > > > > >>>>> > > > >> > > > > >> long
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> term.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Assume we have 10 disks
> >> on a
> >> > > > > machine.
> >> > > > > > > > Here
> >> > > > > > > > >>>>> are
> >> > > > > > > > >>>>> > the
> >> > > > > > > > >>>>> > > > >> problems:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Hi Dong,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Thanks for the thoughtful
> >> > reply.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 1) Our stress test
> result
> >> > shows
> >> > > > > that
> >> > > > > > > > >>>>> > > > >> one-broker-per-disk
> >> > > > > > > > >>>>> > > > >> > > has 15%
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> lower
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> throughput
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 2) Controller would need
> >> to
> >> > > send
> >> > > > > 10X
> >> > > > > > as
> >> > > > > > > > >>>>> many
> >> > > > > > > > >>>>> > > > >> > > LeaderAndIsrRequest,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> MetadataUpdateRequest
> and
> >> > > > > > > > >>>>> StopReplicaRequest.
> >> > > > > > > > >>>>> > This
> >> > > > > > > > >>>>> > > > >> > > increases the
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> burden
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> on
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> controller which can be
> >> the
> >> > > > > > performance
> >> > > > > > > > >>>>> > bottleneck.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Maybe I'm
> misunderstanding
> >> > > > > something,
> >> > > > > > > but
> >> > > > > > > > >>>>> there
> >> > > > > > > > >>>>> > > would
> >> > > > > > > > >>>>> > > > >> not be
> >> > > > > > > > >>>>> > > > >> > > 10x as
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> many
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> StopReplicaRequest RPCs,
> >> would
> >> > > > > there?
> >> > > > > > > The
> >> > > > > > > > >>>>> other
> >> > > > > > > > >>>>> > > > >> requests
> >> > > > > > > > >>>>> > > > >> > > would
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> increase
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> 10x, but from a pretty
> low
> >> > base,
> >> > > > > > right?
> >> > > > > > > > We
> >> > > > > > > > >>>>> are
> >> > > > > > > > >>>>> > not
> >> > > > > > > > >>>>> > > > >> > > reassigning
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> partitions all the time,
> I
> >> > hope
> >> > > > (or
> >> > > > > > else
> >> > > > > > > > we
> >> > > > > > > > >>>>> have
> >> > > > > > > > >>>>> > > > bigger
> >> > > > > > > > >>>>> > > > >> > > > > >> problems...)
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I think the controller
> will
> >> > group
> >> > > > > > > > >>>>> > StopReplicaRequest
> >> > > > > > > > >>>>> > > > per
> >> > > > > > > > >>>>> > > > >> > > broker and
> >> > > > > > > > >>>>> > > > >> > > > > >> >> send
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> only one
> StopReplicaRequest
> >> to
> >> > a
> >> > > > > broker
> >> > > > > > > > >>>>> during
> >> > > > > > > > >>>>> > > > controlled
> >> > > > > > > > >>>>> > > > >> > > shutdown.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> Anyway,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> we don't have to worry
> about
> >> > this
> >> > > > if
> >> > > > > we
> >> > > > > > > > >>>>> agree that
> >> > > > > > > > >>>>> > > > other
> >> > > > > > > > >>>>> > > > >> > > requests
> >> > > > > > > > >>>>> > > > >> > > > > >> will
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> increase by 10X. One
> >> > > > MetadataRequest
> >> > > > > to
> >> > > > > > > > send
> >> > > > > > > > >>>>> to
> >> > > > > > > > >>>>> > each
> >> > > > > > > > >>>>> > > > >> broker
> >> > > > > > > > >>>>> > > > >> > > in the
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> cluster
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> every time there is
> >> leadership
> >> > > > > change.
> >> > > > > > I
> >> > > > > > > am
> >> > > > > > > > >>>>> not
> >> > > > > > > > >>>>> > sure
> >> > > > > > > > >>>>> > > > >> this is
> >> > > > > > > > >>>>> > > > >> > > a real
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> problem. But in theory
> this
> >> > makes
> >> > > > the
> >> > > > > > > > >>>>> overhead
> >> > > > > > > > >>>>> > > > complexity
> >> > > > > > > > >>>>> > > > >> > > O(number
> >> > > > > > > > >>>>> > > > >> > > > > >> of
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> broker) and may be a
> >> concern in
> >> > > the
> >> > > > > > > future.
> >> > > > > > > > >>>>> Ideally
> >> > > > > > > > >>>>> > > we
> >> > > > > > > > >>>>> > > > >> should
> >> > > > > > > > >>>>> > > > >> > > avoid
> >> > > > > > > > >>>>> > > > >> > > > > >> it.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 3) Less efficient use of
> >> > > physical
> >> > > > > > > > resource
> >> > > > > > > > >>>>> on the
> >> > > > > > > > >>>>> > > > >> machine.
> >> > > > > > > > >>>>> > > > >> > > The
> >> > > > > > > > >>>>> > > > >> > > > > >> number
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> of
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> socket on each machine
> >> will
> >> > > > > increase
> >> > > > > > by
> >> > > > > > > > >>>>> 10X. The
> >> > > > > > > > >>>>> > > > >> number of
> >> > > > > > > > >>>>> > > > >> > > > > >> connection
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> between any two machine
> >> will
> >> > > > > increase
> >> > > > > > > by
> >> > > > > > > > >>>>> 100X.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 4) Less efficient way to
> >> > > > management
> >> > > > > > > > memory
> >> > > > > > > > >>>>> and
> >> > > > > > > > >>>>> > > quota.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> 5) Rebalance between
> >> > > > disks/brokers
> >> > > > > on
> >> > > > > > > the
> >> > > > > > > > >>>>> same
> >> > > > > > > > >>>>> > > > machine
> >> > > > > > > > >>>>> > > > >> will
> >> > > > > > > > >>>>> > > > >> > > less
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> efficient
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> and less flexible.
> Broker
> >> has
> >> > > to
> >> > > > > read
> >> > > > > > > > data
> >> > > > > > > > >>>>> from
> >> > > > > > > > >>>>> > > > another
> >> > > > > > > > >>>>> > > > >> > > broker on
> >> > > > > > > > >>>>> > > > >> > > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> same
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> machine via socket. It
> is
> >> > also
> >> > > > > harder
> >> > > > > > > to
> >> > > > > > > > do
> >> > > > > > > > >>>>> > > automatic
> >> > > > > > > > >>>>> > > > >> load
> >> > > > > > > > >>>>> > > > >> > > balance
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> between
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> disks on the same
> machine
> >> in
> >> > > the
> >> > > > > > > future.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I will put this and the
> >> > > > explanation
> >> > > > > > in
> >> > > > > > > > the
> >> > > > > > > > >>>>> > rejected
> >> > > > > > > > >>>>> > > > >> > > alternative
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> section.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> have a few questions:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Can you explain why
> this
> >> > > > solution
> >> > > > > > can
> >> > > > > > > > >>>>> help
> >> > > > > > > > >>>>> > avoid
> >> > > > > > > > >>>>> > > > >> > > scalability
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> bottleneck?
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> I actually think it will
> >> > > > exacerbate
> >> > > > > > the
> >> > > > > > > > >>>>> > scalability
> >> > > > > > > > >>>>> > > > >> problem
> >> > > > > > > > >>>>> > > > >> > > due
> >> > > > > > > > >>>>> > > > >> > > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> 2)
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> above.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - Why can we push more
> RPC
> >> > with
> >> > > > > this
> >> > > > > > > > >>>>> solution?
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> To really answer this
> >> question
> >> > > > we'd
> >> > > > > > have
> >> > > > > > > > to
> >> > > > > > > > >>>>> take a
> >> > > > > > > > >>>>> > > > deep
> >> > > > > > > > >>>>> > > > >> dive
> >> > > > > > > > >>>>> > > > >> > > into
> >> > > > > > > > >>>>> > > > >> > > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> locking of the broker and
> >> > figure
> >> > > > out
> >> > > > > > how
> >> > > > > > > > >>>>> > effectively
> >> > > > > > > > >>>>> > > > it
> >> > > > > > > > >>>>> > > > >> can
> >> > > > > > > > >>>>> > > > >> > > > > >> >> parallelize
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> truly independent
> requests.
> >> > > > Almost
> >> > > > > > > every
> >> > > > > > > > >>>>> > > > multithreaded
> >> > > > > > > > >>>>> > > > >> > > process is
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> going
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> to have shared state,
> like
> >> > > shared
> >> > > > > > queues
> >> > > > > > > > or
> >> > > > > > > > >>>>> shared
> >> > > > > > > > >>>>> > > > >> sockets,
> >> > > > > > > > >>>>> > > > >> > > that is
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> going to make scaling
> less
> >> > than
> >> > > > > linear
> >> > > > > > > > when
> >> > > > > > > > >>>>> you
> >> > > > > > > > >>>>> > add
> >> > > > > > > > >>>>> > > > >> disks or
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> processors.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> (And clearly, another
> >> option
> >> > is
> >> > > to
> >> > > > > > > improve
> >> > > > > > > > >>>>> that
> >> > > > > > > > >>>>> > > > >> scalability,
> >> > > > > > > > >>>>> > > > >> > > rather
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> than going
> multi-process!)
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Yeah I also think it is
> >> better
> >> > to
> >> > > > > > improve
> >> > > > > > > > >>>>> > scalability
> >> > > > > > > > >>>>> > > > >> inside
> >> > > > > > > > >>>>> > > > >> > > kafka
> >> > > > > > > > >>>>> > > > >> > > > > >> code
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> if
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> possible. I am not sure we
> >> > > > currently
> >> > > > > > have
> >> > > > > > > > any
> >> > > > > > > > >>>>> > > > scalability
> >> > > > > > > > >>>>> > > > >> > > issue
> >> > > > > > > > >>>>> > > > >> > > > > >> inside
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Kafka that can not be
> >> removed
> >> > > > without
> >> > > > > > > using
> >> > > > > > > > >>>>> > > > >> multi-process.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> - It is true that a
> >> garbage
> >> > > > > > collection
> >> > > > > > > in
> >> > > > > > > > >>>>> one
> >> > > > > > > > >>>>> > > broker
> >> > > > > > > > >>>>> > > > >> would
> >> > > > > > > > >>>>> > > > >> > > not
> >> > > > > > > > >>>>> > > > >> > > > > >> affect
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> others. But that is
> after
> >> > every
> >> > > > > > broker
> >> > > > > > > > >>>>> only uses
> >> > > > > > > > >>>>> > > 1/10
> >> > > > > > > > >>>>> > > > >> of the
> >> > > > > > > > >>>>> > > > >> > > > > >> memory.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Can
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> we be sure that this
> will
> >> > > > actually
> >> > > > > > help
> >> > > > > > > > >>>>> > > performance?
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> The big question is, how
> >> much
> >> > > > memory
> >> > > > > > do
> >> > > > > > > > >>>>> Kafka
> >> > > > > > > > >>>>> > > brokers
> >> > > > > > > > >>>>> > > > >> use
> >> > > > > > > > >>>>> > > > >> > > now, and
> >> > > > > > > > >>>>> > > > >> > > > > >> how
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> much will they use in the
> >> > > future?
> >> > > > > Our
> >> > > > > > > > >>>>> experience
> >> > > > > > > > >>>>> > in
> >> > > > > > > > >>>>> > > > >> HDFS
> >> > > > > > > > >>>>> > > > >> > > was that
> >> > > > > > > > >>>>> > > > >> > > > > >> >> once
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> you start getting more
> than
> >> > > > > 100-200GB
> >> > > > > > > Java
> >> > > > > > > > >>>>> heap
> >> > > > > > > > >>>>> > > sizes,
> >> > > > > > > > >>>>> > > > >> full
> >> > > > > > > > >>>>> > > > >> > > GCs
> >> > > > > > > > >>>>> > > > >> > > > > >> start
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> taking minutes to finish
> >> when
> >> > > > using
> >> > > > > > the
> >> > > > > > > > >>>>> standard
> >> > > > > > > > >>>>> > > JVMs.
> >> > > > > > > > >>>>> > > > >> That
> >> > > > > > > > >>>>> > > > >> > > alone
> >> > > > > > > > >>>>> > > > >> > > > > >> is
> >> > > > > > > > >>>>> > > > >> > > > > >> >> a
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> good reason to go
> >> > multi-process
> >> > > or
> >> > > > > > > > consider
> >> > > > > > > > >>>>> > storing
> >> > > > > > > > >>>>> > > > more
> >> > > > > > > > >>>>> > > > >> > > things off
> >> > > > > > > > >>>>> > > > >> > > > > >> >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Java heap.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I see. Now I agree
> >> > > > > one-broker-per-disk
> >> > > > > > > > >>>>> should be
> >> > > > > > > > >>>>> > more
> >> > > > > > > > >>>>> > > > >> > > efficient in
> >> > > > > > > > >>>>> > > > >> > > > > >> >> terms
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> of
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> GC since each broker
> >> probably
> >> > > needs
> >> > > > > > less
> >> > > > > > > > >>>>> than 1/10
> >> > > > > > > > >>>>> > of
> >> > > > > > > > >>>>> > > > the
> >> > > > > > > > >>>>> > > > >> > > memory
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> available
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> on a typical machine
> >> nowadays.
> >> > I
> >> > > > will
> >> > > > > > > > remove
> >> > > > > > > > >>>>> this
> >> > > > > > > > >>>>> > > from
> >> > > > > > > > >>>>> > > > >> the
> >> > > > > > > > >>>>> > > > >> > > reason of
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> rejection.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Disk failure is the
> "easy"
> >> > case.
> >> > > > > The
> >> > > > > > > > >>>>> "hard" case,
> >> > > > > > > > >>>>> > > > >> which is
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> unfortunately also the
> much
> >> > more
> >> > > > > > common
> >> > > > > > > > >>>>> case, is
> >> > > > > > > > >>>>> > > disk
> >> > > > > > > > >>>>> > > > >> > > misbehavior.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Towards the end of their
> >> > lives,
> >> > > > > disks
> >> > > > > > > tend
> >> > > > > > > > >>>>> to
> >> > > > > > > > >>>>> > start
> >> > > > > > > > >>>>> > > > >> slowing
> >> > > > > > > > >>>>> > > > >> > > down
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> unpredictably.  Requests
> >> that
> >> > > > would
> >> > > > > > have
> >> > > > > > > > >>>>> completed
> >> > > > > > > > >>>>> > > > >> > > immediately
> >> > > > > > > > >>>>> > > > >> > > > > >> before
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> start taking 20, 100 500
> >> > > > > milliseconds.
> >> > > > > > > > >>>>> Some files
> >> > > > > > > > >>>>> > > may
> >> > > > > > > > >>>>> > > > >> be
> >> > > > > > > > >>>>> > > > >> > > readable
> >> > > > > > > > >>>>> > > > >> > > > > >> and
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> other files may not be.
> >> > System
> >> > > > > calls
> >> > > > > > > > hang,
> >> > > > > > > > >>>>> > > sometimes
> >> > > > > > > > >>>>> > > > >> > > forever, and
> >> > > > > > > > >>>>> > > > >> > > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Java process can't abort
> >> them,
> >> > > > > because
> >> > > > > > > the
> >> > > > > > > > >>>>> hang is
> >> > > > > > > > >>>>> > > in
> >> > > > > > > > >>>>> > > > >> the
> >> > > > > > > > >>>>> > > > >> > > kernel.
> >> > > > > > > > >>>>> > > > >> > > > > >> It
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> is
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> not fun when threads are
> >> stuck
> >> > > in
> >> > > > "D
> >> > > > > > > > state"
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > http://stackoverflow.com/quest
> >> > > > > > > > >>>>> > > > >> ions/20423521/process-perminan
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> tly-stuck-on-d-state
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> .  Even kill -9 cannot
> >> abort
> >> > the
> >> > > > > > thread
> >> > > > > > > > >>>>> then.
> >> > > > > > > > >>>>> > > > >> Fortunately,
> >> > > > > > > > >>>>> > > > >> > > this is
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> rare.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> I agree it is a harder
> >> problem
> >> > > and
> >> > > > it
> >> > > > > > is
> >> > > > > > > > >>>>> rare. We
> >> > > > > > > > >>>>> > > > >> probably
> >> > > > > > > > >>>>> > > > >> > > don't
> >> > > > > > > > >>>>> > > > >> > > > > >> have
> >> > > > > > > > >>>>> > > > >> > > > > >> >> to
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> worry about it in this KIP
> >> > since
> >> > > > this
> >> > > > > > > issue
> >> > > > > > > > >>>>> is
> >> > > > > > > > >>>>> > > > >> orthogonal to
> >> > > > > > > > >>>>> > > > >> > > > > >> whether or
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> not
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> we use JBOD.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Another approach we
> should
> >> > > > consider
> >> > > > > is
> >> > > > > > > for
> >> > > > > > > > >>>>> Kafka
> >> > > > > > > > >>>>> > to
> >> > > > > > > > >>>>> > > > >> > > implement its
> >> > > > > > > > >>>>> > > > >> > > > > >> own
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> storage layer that would
> >> > stripe
> >> > > > > across
> >> > > > > > > > >>>>> multiple
> >> > > > > > > > >>>>> > > disks.
> >> > > > > > > > >>>>> > > > >> This
> >> > > > > > > > >>>>> > > > >> > > > > >> wouldn't
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> have to be done at the
> >> block
> >> > > > level,
> >> > > > > > but
> >> > > > > > > > >>>>> could be
> >> > > > > > > > >>>>> > > done
> >> > > > > > > > >>>>> > > > >> at the
> >> > > > > > > > >>>>> > > > >> > > file
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> level.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> We could use consistent
> >> > hashing
> >> > > to
> >> > > > > > > > >>>>> determine which
> >> > > > > > > > >>>>> > > > >> disks a
> >> > > > > > > > >>>>> > > > >> > > file
> >> > > > > > > > >>>>> > > > >> > > > > >> should
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> end up on, for example.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> Are you suggesting that we
> >> > should
> >> > > > > > > > distribute
> >> > > > > > > > >>>>> log,
> >> > > > > > > > >>>>> > or
> >> > > > > > > > >>>>> > > > log
> >> > > > > > > > >>>>> > > > >> > > segment,
> >> > > > > > > > >>>>> > > > >> > > > > >> >> across
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> disks of brokers? I am not
> >> sure
> >> > > if
> >> > > > I
> >> > > > > > > fully
> >> > > > > > > > >>>>> > understand
> >> > > > > > > > >>>>> > > > >> this
> >> > > > > > > > >>>>> > > > >> > > > > >> approach. My
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> gut
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> feel is that this would
> be a
> >> > > > drastic
> >> > > > > > > > >>>>> solution that
> >> > > > > > > > >>>>> > > > would
> >> > > > > > > > >>>>> > > > >> > > require
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> non-trivial design. While
> >> this
> >> > > may
> >> > > > be
> >> > > > > > > > useful
> >> > > > > > > > >>>>> to
> >> > > > > > > > >>>>> > > Kafka,
> >> > > > > > > > >>>>> > > > I
> >> > > > > > > > >>>>> > > > >> would
> >> > > > > > > > >>>>> > > > >> > > > > >> prefer
> >> > > > > > > > >>>>> > > > >> > > > > >> >> not
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> to discuss this in detail
> in
> >> > this
> >> > > > > > thread
> >> > > > > > > > >>>>> unless you
> >> > > > > > > > >>>>> > > > >> believe
> >> > > > > > > > >>>>> > > > >> > > it is
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> strictly
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> superior to the design in
> >> this
> >> > > KIP
> >> > > > in
> >> > > > > > > terms
> >> > > > > > > > >>>>> of
> >> > > > > > > > >>>>> > > solving
> >> > > > > > > > >>>>> > > > >> our
> >> > > > > > > > >>>>> > > > >> > > use-case.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> best,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> Colin
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Thanks,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> Dong
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> On Wed, Jan 25, 2017 at
> >> 11:34
> >> > > AM,
> >> > > > > > Colin
> >> > > > > > > > >>>>> McCabe <
> >> > > > > > > > >>>>> > > > >> > > > > >> cmccabe@apache.org>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Hi Dong,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Thanks for the writeup!
> >> > It's
> >> > > > very
> >> > > > > > > > >>>>> interesting.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> I apologize in advance
> if
> >> > this
> >> > > > has
> >> > > > > > > been
> >> > > > > > > > >>>>> > discussed
> >> > > > > > > > >>>>> > > > >> > > somewhere else.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> But
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> I
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> am curious if you have
> >> > > > considered
> >> > > > > > the
> >> > > > > > > > >>>>> solution
> >> > > > > > > > >>>>> > of
> >> > > > > > > > >>>>> > > > >> running
> >> > > > > > > > >>>>> > > > >> > > > > >> multiple
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> brokers per node.
> >> Clearly
> >> > > there
> >> > > > > is
> >> > > > > > a
> >> > > > > > > > >>>>> memory
> >> > > > > > > > >>>>> > > > overhead
> >> > > > > > > > >>>>> > > > >> with
> >> > > > > > > > >>>>> > > > >> > > this
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>> solution
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> because of the fixed
> >> cost of
> >> > > > > > starting
> >> > > > > > > > >>>>> multiple
> >> > > > > > > > >>>>> > > JVMs.
> >> > > > > > > > >>>>> > > > >> > > However,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> running
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> multiple JVMs would
> help
> >> > avoid
> >> > > > > > > > scalability
> >> > > > > > > > >>>>> > > > >> bottlenecks.
> >> > > > > > > > >>>>> > > > >> > > You
> >> > > > > > > > >>>>> > > > >> > > > > >> could
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> probably push more RPCs
> >> per
> >> > > > > second,
> >> > > > > > > for
> >> > > > > > > > >>>>> example.
> >> > > > > > > > >>>>> > > A
> >> > > > > > > > >>>>> > > > >> garbage
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> collection
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> in one broker would not
> >> > affect
> >> > > > the
> >> > > > > > > > >>>>> others.  It
> >> > > > > > > > >>>>> > > would
> >> > > > > > > > >>>>> > > > >> be
> >> > > > > > > > >>>>> > > > >> > > > > >> interesting
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> to
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> see this considered in
> >> the
> >> > > > > > "alternate
> >> > > > > > > > >>>>> designs"
> >> > > > > > > > >>>>> > > > design,
> >> > > > > > > > >>>>> > > > >> > > even if
> >> > > > > > > > >>>>> > > > >> > > > > >> you
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> end
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> up deciding it's not
> the
> >> way
> >> > > to
> >> > > > > go.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> best,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> Colin
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>> On Thu, Jan 12, 2017,
> at
> >> > > 10:46,
> >> > > > > Dong
> >> > > > > > > Lin
> >> > > > > > > > >>>>> wrote:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Hi all,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> We created KIP-112:
> >> Handle
> >> > > disk
> >> > > > > > > failure
> >> > > > > > > > >>>>> for
> >> > > > > > > > >>>>> > JBOD.
> >> > > > > > > > >>>>> > > > >> Please
> >> > > > > > > > >>>>> > > > >> > > find
> >> > > > > > > > >>>>> > > > >> > > > > >> the
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> KIP
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> wiki
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> in the link
> >> > > > > > > > >>>>> https://cwiki.apache.org/confl
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> uence/display/KAFKA/KIP-
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> >> 112%3A+Handle+disk+failure+
> >> > > > > > for+JBOD.
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> This KIP is related to
> >> > > KIP-113
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> <
> >> > > https://cwiki.apache.org/conf
> >> > > > > > > > >>>>> > > > >> luence/display/KAFKA/KIP-
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> >> > 113%3A+Support+replicas+moveme
> >> > > > > > > > >>>>> > > > >> nt+between+log+directories>:
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Support replicas
> >> movement
> >> > > > between
> >> > > > > > log
> >> > > > > > > > >>>>> > > directories.
> >> > > > > > > > >>>>> > > > >> They
> >> > > > > > > > >>>>> > > > >> > > are
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> needed
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> in
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> order
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> to support JBOD in
> >> Kafka.
> >> > > > Please
> >> > > > > > help
> >> > > > > > > > >>>>> review
> >> > > > > > > > >>>>> > the
> >> > > > > > > > >>>>> > > > >> KIP. You
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>> feedback
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>> is
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> appreciated!
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Thanks,
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>> Dong
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>>>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>
> >> > > > > > > > >>>>> > > > >> > > > > >> >>
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >>
> >> > > > > > > > >>>>> > > > >> > > > > >
> >> > > > > > > > >>>>> > > > >> > >
> >> > > > > > > > >>>>> > > > >>
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > > >
> >> > > > > > > > >>>>> > > >
> >> > > > > > > > >>>>> > >
> >> > > > > > > > >>>>> >
> >> > > > > > > > >>>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message