kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <...@confluent.io>
Subject Re: How to handle broker disk failure
Date Wed, 21 Jan 2015 16:14:39 GMT
Yes, what you can do is to exclude the bad disk from log.dirs and then
restart the broker. The missing data will be automatically copies over.
This is likely cheaper than reassigning partitions.

Thanks,

Jun

On Wed, Jan 21, 2015 at 7:49 AM, Koert Kuipers <koert@tresata.com> wrote:

> same situation with us. we run jbod and actually dont replace the failed
> data disks at all. we simply keep boxes running until non-failed drives
> falls below some threshold. so our procedure with kafka would be:
> 1) ideally kafka server simply survives failed disk and keeps going, and
> fixes itself with the data disks left.
> 2) if kafka server does not survive failed drive can we start it back up
> with one less data disk and it will fix itself?
>
>
> On Wed, Jan 21, 2015 at 6:11 AM, svante karlsson <saka@csi.se> wrote:
>
> > Is it possible to continue to server topics from the remaining disks
> while
> > waiting for a replacement disk or will the broker exit/stop working. (we
> > would like to be able to replace disks in a relaxed manner since we have
> > the datacenter colocated and we don't have permanent staff there since
> > there is simply not enough things to do to motivate 24h staffing)
> >
> > If we trigger a rebalance during the downtime the under replicated
> > topics/partitions will hopefully be moved somewhere else? What happens
> the
> > when we add the broker again - now with a new empty disk. Will all over
> > replicated partitions be removed from the reinserted broker and finally
> > should/must we trigger a rebalance?
> >
> > /svante
> >
> > 2015-01-21 2:56 GMT+01:00 Jun Rao <jun@confluent.io>:
> >
> > > Actually, you don't need to reassign partitions in this case. You just
> > need
> > > to replace the bad disk and restart the broker. It will copy the
> missing
> > > data over automatically.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Jan 20, 2015 at 1:02 AM, svante karlsson <saka@csi.se> wrote:
> > >
> > > > I'm trying to figure out the best way to handle a disk failure in a
> > live
> > > > environment.
> > > >
> > > > The obvious (and naive) solution is to decommission the broker and
> let
> > > > other brokers taker over and create new followers. Then replace the
> > disk
> > > > and clean the remaining log directories and add the broker again.
> > > >
> > > > The disadvantage with this approach is of course the network overhead
> > and
> > > > the time it takes to reassign partitions.
> > > >
> > > > Is there a better way?
> > > >
> > > > As a sub question, is it possible to continue running a broker with a
> > > > failed drive and still serve the remaining partitions?
> > > >
> > > > thanks,
> > > > svante
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message