Return-Path: X-Original-To: apmail-kafka-users-archive@www.apache.org Delivered-To: apmail-kafka-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ABF9710653 for ; Tue, 7 Jan 2014 09:18:25 +0000 (UTC) Received: (qmail 1299 invoked by uid 500); 7 Jan 2014 09:17:58 -0000 Delivered-To: apmail-kafka-users-archive@kafka.apache.org Received: (qmail 1180 invoked by uid 500); 7 Jan 2014 09:17:45 -0000 Mailing-List: contact users-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@kafka.apache.org Delivered-To: mailing list users@kafka.apache.org Received: (qmail 1163 invoked by uid 99); 7 Jan 2014 09:17:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jan 2014 09:17:42 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=HTML_MESSAGE,MANY_SPAN_IN_TEXT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zecmerquise@gmail.com designates 209.85.219.41 as permitted sender) Received: from [209.85.219.41] (HELO mail-oa0-f41.google.com) (209.85.219.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jan 2014 09:17:35 +0000 Received: by mail-oa0-f41.google.com with SMTP id j17so20271062oag.14 for ; Tue, 07 Jan 2014 01:17:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=2hA0ZcLWcA8NyuVnMniGxHrQKYfB49H+Mhwgr5HyFfs=; b=Konm+Gx6DhI8degluSpCocveOmawFSMQmRLjro2mrdD/1u14jQFsZtMFbaQHhFn+5W hbv7tRJbP/s2r+ClZxvkLhXi4an1OFzTBpVh9RwIz9t0roemBht2yNxPx+oiNw/wo5t/ nh8R9y7Lr9E+X1dNumx7Ym0/5l+5exUBP49ns82o9vn5DUyLsPQs1k0xanMwBXAbqWPk 8sZNONH2i6TwrQemdMDmo+FI/j+JSVJcDXpEJ7juKMz2cg5HryotjKxFhouZ4w6N4tJq LIt8ZJT97dDrcbQ5VqCrbu/aA1aryXX7uPkMPzEivQ3DQOB/2+SYj/UpQ+n0ndxuq/Gz +YQw== MIME-Version: 1.0 X-Received: by 10.182.194.5 with SMTP id hs5mr76109289obc.19.1389086234634; Tue, 07 Jan 2014 01:17:14 -0800 (PST) Sender: zecmerquise@gmail.com Received: by 10.60.149.130 with HTTP; Tue, 7 Jan 2014 01:17:14 -0800 (PST) In-Reply-To: References: Date: Tue, 7 Jan 2014 10:17:14 +0100 X-Google-Sender-Auth: 3n8wMu9nKSqfoATDZfFFCpi58Co Message-ID: Subject: Re: Trouble recovering after a crashed broker From: Vincent Rischmann To: users@kafka.apache.org Content-Type: multipart/alternative; boundary=f46d0444eb2993f2dd04ef5dd7d5 X-Virus-Checked: Checked by ClamAV on apache.org --f46d0444eb2993f2dd04ef5dd7d5 Content-Type: text/plain; charset=UTF-8 Hi, this is the output of list topic: topic: clicks partition: 0 leader: 1 replicas: 1 isr: 1 topic: clicks partition: 1 leader: 3 replicas: 3 isr: 3 topic: clicks partition: 2 leader: 1 replicas: 1 isr: 1 topic: visits partition: 0 leader: 3 replicas: 3 isr: 3 topic: visits partition: 1 leader: 2 replicas: 2 isr: 2 topic: visits partition: 2 leader: 3 replicas: 3 isr: 3 topic: stats.live.test partition: 0 leader: 3 replicas: 3,1,2 isr: 3,2,1 topic: stats.live.test partition: 1 leader: 2 replicas: 1,2,3 isr: 2,3,1 topic: stats.live.test partition: 2 leader: 2 replicas: 2,3,1 isr: 2,3,1 The topic causing problems is "clicks", and the partitions requested on the crashed broker are 0 and 2. Given the output of list topic, this means that those 2 partitions are permanently lost right now, right ? I thought all partitions were replicated, just like for the topic 'stats.live.test', but apparently I screwed up when creating the topics, I should have check that first. Thanks for your help. 2014/1/6 Jun Rao > How many replicas do you have on that topic? What's the output of list > topic? > > Thanks, > > Jun > > > On Mon, Jan 6, 2014 at 1:45 AM, Vincent Rischmann >wrote: > > > Hi, > > > > yes, I'm seeing the errors on the crashed broker. > > > > My controller.log file only contains the following: > > > > [2014-01-03 09:41:01,794] INFO [ControllerEpochListener on 1]: > Initialized > > controller epoch to 11 and zk version 10 > > (kafka.controller.ControllerEpochListener) > > [2014-01-03 09:41:01,812] INFO [Controller 1]: Controller starting up > > (kafka.controller.KafkaController) > > [2014-01-03 09:41:02,082] INFO [Controller 1]: Controller startup > complete > > (kafka.controller.KafkaController) > > > > Since friday, nothing has changed and the broker generated multiples > > gigabytes of traces in server.log, one of the last exception looks like > > this: > > > > Request for offset 787449 but we only have log segments in the range 0 to > > 163110. > > > > The range has increased since friday (it was "0 to 19372"), does this > mean > > the broker is actually catching up ? > > > > > > Thanks for your help. > > > > > > > > > > 2014/1/3 Jun Rao > > > > > If a broker crashes and restarts, it will catch up the missing data > from > > > the leader replicas. Normally, when this broker is catching up, it > won't > > be > > > serving any client requests though. Are you seeing those errors on the > > > crashed broker? Also, you are not supposed to see > > OffsetOutOfRangeException > > > with just one broker failure with 3 replicas. Do you see the following > in > > > the controller log? > > > > > > "No broker in ISR is alive for ... There's potential data loss." > > > > > > Thanks, > > > > > > Jun > > > > > > On Fri, Jan 3, 2014 at 1:23 AM, Vincent Rischmann < > zecmerquise@gmail.com > > > >wrote: > > > > > > > Hi all, > > > > > > > > We have a cluster of 3 0.8 brokers, and this morning one of the > broker > > > > crashed. > > > > It is a test broker, and we stored the logs in /tmp/kafka-logs. All > > > topics > > > > in use are replicated on the three brokers. > > > > > > > > You can guess the problem, when the broker rebooted it wiped all the > > data > > > > in the logs. > > > > > > > > The producers and consumers are fine, but the broker with the wiped > > data > > > > keeps generating a lot of exceptions, and I don't really know what to > > do > > > to > > > > recover. > > > > > > > > Example exception: > > > > > > > > [2014-01-03 10:09:47,755] ERROR [KafkaApi-1] Error when processing > > fetch > > > > request for partition [topic,0] offset 814798 from consumer with > > > > correlation id 0 (kafka.server.KafkaApis) > > > > kafka.common.OffsetOutOfRangeException: Request for offset 814798 but > > we > > > > only have log segments in the range 0 to 19372. > > > > > > > > There are a lot of them, something like 10+ per second. I (maybe > > wrongly) > > > > assumed that the broker would catch up, if that's the case how can I > > see > > > > the progress ? > > > > > > > > In general, what is the recommended way to bring back a broker with > > wiped > > > > data in a cluster ? > > > > > > > > Thanks. > > > > > > > > > > --f46d0444eb2993f2dd04ef5dd7d5--