Mailing-List: contact users-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@kafka.apache.org
Received-SPF: pass (nike.apache.org: domain of zecmerquise@gmail.com
 designates 209.85.219.41 as permitted sender)
MIME-Version: 1.0
Sender: zecmerquise@gmail.com
In-Reply-To: 
 <CAFbh0Q3ZibTUg2C=q8M7Jjwti94p=4S8yPxbxxc9ExWToZ4uUw@mail.gmail.com>
References: 
 <CAERq+9q-4M-Obw+TVJxZZKoyKX0+tQrp06TGeb0ihpcT5aYfKQ@mail.gmail.com>
	<CAFbh0Q3ZY5XuGWv5p7SfiNPZCTTAp6uGH2fHbmwGnFGAQOrUPA@mail.gmail.com>
	<CAERq+9ru35UnP16Q782yh1nmP+F45WNH2qt+rwf-=py-X6vNxw@mail.gmail.com>
	<CAFbh0Q3ZibTUg2C=q8M7Jjwti94p=4S8yPxbxxc9ExWToZ4uUw@mail.gmail.com>
Date: Tue, 7 Jan 2014 10:17:14 +0100
Message-ID: 
 <CAERq+9qtbhfana=Pe8REwnQ+o-yGb_vyoUy-+TWUiemcmNgqdw@mail.gmail.com>
Subject: Re: Trouble recovering after a crashed broker
From: Vincent Rischmann <vincent@rischmann.fr>
To: users@kafka.apache.org
Content-Type: multipart/alternative; boundary=f46d0444eb2993f2dd04ef5dd7d5

--f46d0444eb2993f2dd04ef5dd7d5
Content-Type: text/plain; charset=UTF-8

Hi,

this is the output of list topic:

topic: clicks partition: 0 leader: 1 replicas: 1 isr: 1
topic: clicks partition: 1 leader: 3 replicas: 3 isr: 3
topic: clicks partition: 2 leader: 1 replicas: 1 isr: 1
topic: visits partition: 0 leader: 3 replicas: 3 isr: 3
topic: visits partition: 1 leader: 2 replicas: 2 isr: 2
topic: visits partition: 2 leader: 3 replicas: 3 isr: 3
topic: stats.live.test partition: 0 leader: 3 replicas: 3,1,2 isr: 3,2,1
topic: stats.live.test partition: 1 leader: 2 replicas: 1,2,3 isr: 2,3,1
topic: stats.live.test partition: 2 leader: 2 replicas: 2,3,1 isr: 2,3,1

The topic causing problems is "clicks", and the partitions requested on the
crashed broker are 0 and 2.
Given the output of list topic, this means that those 2 partitions are
permanently lost right now, right ?

I thought all partitions were replicated, just like for the topic
'stats.live.test', but apparently I screwed up when creating the topics, I
should have check that first.

Thanks for your help.


2014/1/6 Jun Rao <junrao@gmail.com>

> How many replicas do you have on that topic? What's the output of list
> topic?
>
> Thanks,
>
> Jun
>
>
> On Mon, Jan 6, 2014 at 1:45 AM, Vincent Rischmann <vincent@rischmann.fr
> >wrote:
>
> > Hi,
> >
> > yes, I'm seeing the errors on the crashed broker.
> >
> > My controller.log file only contains the following:
> >
> > [2014-01-03 09:41:01,794] INFO [ControllerEpochListener on 1]:
> Initialized
> > controller epoch to 11 and zk version 10
> > (kafka.controller.ControllerEpochListener)
> > [2014-01-03 09:41:01,812] INFO [Controller 1]: Controller starting up
> > (kafka.controller.KafkaController)
> > [2014-01-03 09:41:02,082] INFO [Controller 1]: Controller startup
> complete
> > (kafka.controller.KafkaController)
> >
> > Since friday, nothing has changed and the broker generated multiples
> > gigabytes of traces in server.log, one of the last exception looks like
> > this:
> >
> > Request for offset 787449 but we only have log segments in the range 0 to
> > 163110.
> >
> > The range has increased since friday (it was "0 to 19372"), does this
> mean
> > the broker is actually catching up ?
> >
> >
> > Thanks for your help.
> >
> >
> >
> >
> > 2014/1/3 Jun Rao <junrao@gmail.com>
> >
> > > If a broker crashes and restarts, it will catch up the missing data
> from
> > > the leader replicas. Normally, when this broker is catching up, it
> won't
> > be
> > > serving any client requests though. Are you seeing those errors on the
> > > crashed broker? Also, you are not supposed to see
> > OffsetOutOfRangeException
> > > with just one broker failure with 3 replicas. Do you see the following
> in
> > > the controller log?
> > >
> > > "No broker in ISR is alive for ... There's potential data loss."
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Jan 3, 2014 at 1:23 AM, Vincent Rischmann <
> zecmerquise@gmail.com
> > > >wrote:
> > >
> > > > Hi all,
> > > >
> > > > We have a cluster of 3 0.8 brokers, and this morning one of the
> broker
> > > > crashed.
> > > > It is a test broker, and we stored the logs in /tmp/kafka-logs. All
> > > topics
> > > > in use are replicated on the three brokers.
> > > >
> > > > You can guess the problem, when the broker rebooted it wiped all the
> > data
> > > > in the logs.
> > > >
> > > > The producers and consumers are fine, but the broker with the wiped
> > data
> > > > keeps generating a lot of exceptions, and I don't really know what to
> > do
> > > to
> > > > recover.
> > > >
> > > > Example exception:
> > > >
> > > > [2014-01-03 10:09:47,755] ERROR [KafkaApi-1] Error when processing
> > fetch
> > > > request for partition [topic,0] offset 814798 from consumer with
> > > > correlation id 0 (kafka.server.KafkaApis)
> > > > kafka.common.OffsetOutOfRangeException: Request for offset 814798 but
> > we
> > > > only have log segments in the range 0 to 19372.
> > > >
> > > > There are a lot of them, something like 10+ per second. I (maybe
> > wrongly)
> > > > assumed that the broker would catch up, if that's the case how can I
> > see
> > > > the progress ?
> > > >
> > > > In general, what is the recommended way to bring back a broker with
> > wiped
> > > > data in a cluster ?
> > > >
> > > > Thanks.
> > > >
> > >
> >
>

--f46d0444eb2993f2dd04ef5dd7d5--