Return-Path: X-Original-To: apmail-kudu-user-archive@minotaur.apache.org Delivered-To: apmail-kudu-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 98FB118DAF for ; Sat, 20 Feb 2016 01:05:07 +0000 (UTC) Received: (qmail 85228 invoked by uid 500); 20 Feb 2016 01:05:07 -0000 Delivered-To: apmail-kudu-user-archive@kudu.apache.org Received: (qmail 85191 invoked by uid 500); 20 Feb 2016 01:05:07 -0000 Mailing-List: contact user-help@kudu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.incubator.apache.org Delivered-To: mailing list user@kudu.incubator.apache.org Received: (qmail 85181 invoked by uid 99); 20 Feb 2016 01:05:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Feb 2016 01:05:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E978BC3344 for ; Sat, 20 Feb 2016 01:05:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.449 X-Spam-Level: * X-Spam-Status: No, score=1.449 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id B4OvMAVg586J for ; Sat, 20 Feb 2016 01:05:05 +0000 (UTC) Received: from mail-io0-f171.google.com (mail-io0-f171.google.com [209.85.223.171]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 0A46F5F1D5 for ; Sat, 20 Feb 2016 01:05:05 +0000 (UTC) Received: by mail-io0-f171.google.com with SMTP id 9so127420934iom.1 for ; Fri, 19 Feb 2016 17:05:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=TiL9oGUH9NpPmIBm09NfK5+ZSoGIhNSUrmvYqMYjryA=; b=BI0mn8+VaLUgy2BA7Q0PCvXVpgSvNFtaa4L8eTksK9fAzsgNmuRoOOGozX7J/RKLzM TR/E2HZPGVXs09aq72QX3c0y7QkutmOhDUuKK0DxzGOQMwIGa18YFZbsX5SILNcpgqoj VpGwmB4D9tIIPRJ1TYRptHnepgbg9lm6yNxvZO0BR1UKrnfzcOWd0RX7IiIwcmWlzzhd KkWaIuVlEw4eYd2TgIC28Q0HNc9VtJjfTpRkaT0mCNk4F5hOXNQhLxzPJweHD4zDe8Sz VaxwmjJ1B6Z6PQ3/vfp6Ig/HTiwKNK0an/Y2Oeq17KALG9H3jyz+U+pb7LiNbUD7UXXY tPiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=TiL9oGUH9NpPmIBm09NfK5+ZSoGIhNSUrmvYqMYjryA=; b=M18G5rTPEy0t79naLnDVkh5KUCr8rGTZBf90Ga1C5EUS853rV3gc+g5FEKwmJMoph6 7neb2ReiCURIS712wpNQpMK0RDe58AyiU8ZhdgTSoi2UyeauyLLEObq+lwsiIGOJJ5QY eXEfDxER+UWSyiULsv62yMRnp3MvQo4uL3eP8ve+XKaaEh9P3vf0CR/3ZrIdA2/S2iYc aSXDNklsTojiyzAWWo6Pu/9NssqHwQ8SoIopNmpTXFO90fIOtbJxV1Ho2OZLwPXJmNrl w9Rn2V82clOwfvdXV/3uVw/meLEHy1BKH8IRgGVxCzirjnODZzNVno8QFKKXpNRFhqwh kw2w== X-Gm-Message-State: AG10YOSGamMPUTZr3pOxa9MhWzc0vhckXQtqHweT9FbzjPqoltl4eZx5Q3L8+9+xoP+SRSvPsYDmidxV2CeTyw== MIME-Version: 1.0 X-Received: by 10.107.135.156 with SMTP id r28mr17882892ioi.40.1455930304064; Fri, 19 Feb 2016 17:05:04 -0800 (PST) Received: by 10.107.12.233 with HTTP; Fri, 19 Feb 2016 17:05:04 -0800 (PST) In-Reply-To: References: Date: Fri, 19 Feb 2016 17:05:04 -0800 Message-ID: Subject: Re: [KUDU Tablet]unrecoverable crash From: Nick Wolf To: user@kudu.incubator.apache.org Content-Type: multipart/alternative; boundary=001a113ed432977f50052c293097 --001a113ed432977f50052c293097 Content-Type: text/plain; charset=UTF-8 I've identified the tablet ID and tried to delete and start the server but it seems like chain reaction. They keep coming one by one with different tablet ids. rm tablet-meta/1c2475126c7c4cc2b82f95bd6af5bdb4 rm wals/1c2475126c7c4cc2b82f95bd6af5bdb4 rm consensus-meta/1c2475126c7c4cc2b82f95bd6af5bdb4 A notable point here is none of the tablet ids that it shows bootstrapping are not appearing in web interface. (http://host:8051/tables) On Fri, Feb 19, 2016 at 12:49 PM, Todd Lipcon wrote: > Hi Nick, > > Are you able to determine the tablet ID that is failing to restart? > The log line indicates that it's thread ID 6285. If you look farther > up the log with 'grep " 6285 " kudu-tserver.INFO', you should see a > log message indicating that that thread is starting to bootstrap a > particular tablet. > > Is this a replicated table, or num_replicas=1? If it's replicated, we > can probably recover by removing the corrupt replica and letting it > grab a new copy from one of the other replicas. Otherwise, we'll have > to do some more serious "surgery" which we can assist you with. > > Either way, see if you can figure out the bad tablet ID. Then, if it's > possible to send a copy of the WAL directory for this tablet to me off > list, I can try to do some post-mortem analysis to see what went > wrong. > > Thanks > -Todd > > On Fri, Feb 19, 2016 at 12:37 PM, Nick Wolf wrote: > > KUDU Tablet crashed with following fatal error. > > > > F0219 12:15:11.389806 6285 mvcc.cc:542] Check failed: _s.ok() Bad > status: > > Illegal state: Timestamp: 5963266013874102274 is already committed. > Current > > Snapshot: MvccSnapshot[committed={T|T < 5963266013874118554 or (T in > > {5963266013874118554})}] > > > > It throws the same fatal error and crashes immediately no matter how many > > times i try to restart the service. > > > > Any ideas to get out of this situation? I don't want to lose the data. > > > > > > --Nick > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera > --001a113ed432977f50052c293097 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I've identified the tablet ID and tried to delete and = start the server but it seems like chain reaction. They keep coming one by = one with different tablet ids.
rm tablet-meta/1c2475126c7c4cc2b82f95bd6= af5bdb4=C2=A0
rm wals/1c2475126c7c4cc2b82f95bd6af5bdb4
rm con= sensus-meta/1c2475126c7c4cc2b82f95bd6af5bdb4

A notable point here is none of the tablet ids that it shows bootstrapping= are not appearing in web interface. (h= ttp://host:8051/tables)

On Fri, Feb 19, 2016 at 12:49 PM, Todd Lipcon <todd@= cloudera.com> wrote:
Hi Nic= k,

Are you able to determine the tablet ID that is failing to restart?
The log line indicates that it's thread ID 6285. If you look farther up the log with 'grep " 6285 " kudu-tserver.INFO', you sh= ould see a
log message indicating that that thread is starting to bootstrap a
particular tablet.

Is this a replicated table, or num_replicas=3D1? If it's replicated, we=
can probably recover by removing the corrupt replica and letting it
grab a new copy from one of the other replicas. Otherwise, we'll have to do some more serious "surgery" which we can assist you with.
Either way, see if you can figure out the bad tablet ID. Then, if it's<= br> possible to send a copy of the WAL directory for this tablet to me off
list, I can try to do some post-mortem analysis to see what went
wrong.

Thanks
-Todd

On Fri, Feb 19, 2016 at 12:37 PM, Nick Wolf <nickwolf7@gmail.com> wrote:
> KUDU Tablet crashed with following fatal error.
>
> F0219 12:15:11.389806=C2=A0 6285 mvcc.cc:542] Check failed: _s.ok() Ba= d status:
> Illegal state: Timestamp: 5963266013874102274 is already committed. Cu= rrent
> Snapshot: MvccSnapshot[committed=3D{T|T < 5963266013874118554 or (T= in
> {5963266013874118554})}]
>
> It throws the same fatal error and crashes immediately no matter how m= any
> times i try to restart the service.
>
> Any ideas to get out of this situation? I don't want to lose the d= ata.
>
>
> --Nick
>



--
Todd Lipcon
Software Engineer, Cloudera

--001a113ed432977f50052c293097--