Return-Path: X-Original-To: apmail-kudu-user-archive@minotaur.apache.org Delivered-To: apmail-kudu-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 40BB219B74 for ; Wed, 6 Apr 2016 00:47:03 +0000 (UTC) Received: (qmail 20787 invoked by uid 500); 6 Apr 2016 00:47:03 -0000 Delivered-To: apmail-kudu-user-archive@kudu.apache.org Received: (qmail 20741 invoked by uid 500); 6 Apr 2016 00:47:03 -0000 Mailing-List: contact user-help@kudu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.incubator.apache.org Delivered-To: mailing list user@kudu.incubator.apache.org Received: (qmail 20733 invoked by uid 99); 6 Apr 2016 00:47:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Apr 2016 00:47:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B5001C031D for ; Wed, 6 Apr 2016 00:47:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.449 X-Spam-Level: * X-Spam-Status: No, score=1.449 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id jzw_Fib58TaV for ; Wed, 6 Apr 2016 00:46:58 +0000 (UTC) Received: from mail-vk0-f48.google.com (mail-vk0-f48.google.com [209.85.213.48]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 7AFC75F59E for ; Wed, 6 Apr 2016 00:46:58 +0000 (UTC) Received: by mail-vk0-f48.google.com with SMTP id e6so39404192vkh.2 for ; Tue, 05 Apr 2016 17:46:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=zwG6st871FEMnvMnwggSCGWz+CKvC++TP+qhj9QgvZc=; b=LpjnyhyMC7N7GDxdnyXgZQQ+5MOb0nwl1PbYAxMDR+0hUjIQSvBVrfpWPxJGRR6BhS LuIHlkpEZY4QF9S4yrnrgMXufJjcKM5CaAHT3BdtFgJLagf9XgVq6VuPlioZrd/FoLJn tTxl3saj/X6st9nQD9PwR7Misj6guzDv9MvN4VH8XNlpa1JWLZccwKYRwi3KN4CE2xWt 2Q5Ygf+3eRAn45AppmzLbwWlvXFlRkNqRl8p3jT8mmGcePDa7OlbyC+cUxeDpXhwwX9Q 9yDFWHHDavd/6YPRDn5jXMNdBE0OpFFWlsc4ECYSovMyVNnoQ+oWwj3+bI2KRl7nFS4B N/gQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=zwG6st871FEMnvMnwggSCGWz+CKvC++TP+qhj9QgvZc=; b=d6T1egR6R8vyUbAvdIJVuQiDzgScPJE4pOlh0FTMtGbhkK9Qz5AMZIHY16+ykiyQtw J2QbyTFhUetAWoY88d5ev7Di4esHv7KaVLoGvDyJrqdUbt0/QcrVpT8V0GAuIL4a9JGm srmP0W9+cGcTSPjuKcES3xcYxEYmnAVXu0uP1v2He3cIP/xGHa0x3F6AdHaEu5N2lHvP lo9XlYedn50iDUi9TyNuxthPXzTl21zb0DfCQ43XXGTtrZ+Kuxh6ZLih6kvdAteDnW0T N6cq0B5cYvmqzRBykve7wWi2FzpVNRPRNBPQiYDC4c7twLUYLZeXKSJoBex09S3g0VKl n4Iw== X-Gm-Message-State: AD7BkJKAyBqxds3g3cOIIgo4t757I9+kzV9vLZXjYcA4We43jFixLciONj1GH1BQmf9tVLqiudwwf7P2uhAYqA== MIME-Version: 1.0 X-Received: by 10.31.52.73 with SMTP id b70mr5876807vka.16.1459903617938; Tue, 05 Apr 2016 17:46:57 -0700 (PDT) Received: by 10.159.55.196 with HTTP; Tue, 5 Apr 2016 17:46:57 -0700 (PDT) In-Reply-To: References: Date: Wed, 6 Apr 2016 08:46:57 +0800 Message-ID: Subject: Re: where is kudu's dump core located? From: Darren Hoo To: user@kudu.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1143fa4c8dbb9d052fc64cff --001a1143fa4c8dbb9d052fc64cff Content-Type: text/plain; charset=UTF-8 On Wed, Apr 6, 2016 at 8:23 AM, Todd Lipcon wrote: > On Tue, Apr 5, 2016 at 5:20 PM, Darren Hoo wrote: > >> kudu constantly crashes after running serveral hours. >> >> > Do you have anything in your logs? /var/log/kudu/kudu-tserver.WARNING for > example? or .INFO? The last few lines should be informative as to what is > causing it to crash. > > >> So I've enabled dumping core in cloudera manager, but where is it located? >> > > It should be located in the same directory as the logs, by default. I > think it can be configured, but I'd check there first. > Thanks! I found it. this is the stacktrace: Core was generated by `/opt/cloudera/parcels/KUDU-0.7.1-1.kudu0.7.1.p0.36/lib/kudu/sbin/kudu-tserver -'. Program terminated with signal 11, Segmentation fault. #0 0x00000000007ca9c8 in tcmalloc::PageHeap::SearchFreeAndLargeLists(unsigned long) () Missing separate debuginfos, use: debuginfo-install cyrus-sasl-gssapi-2.1.23-15.el6_6.2.x86_64 cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 cyrus-sasl-plain-2.1.23-15.el6_6.2.x86_64 db4-4.7.25-20.el6_7.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-42.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64 libgcc-4.4.7-16.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libstdc++-4.4.7-16.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x00000000007ca9c8 in tcmalloc::PageHeap::SearchFreeAndLargeLists(unsigned long) () #1 0x00000000007cb132 in tcmalloc::PageHeap::New(unsigned long) () #2 0x00000000007c9a5a in tcmalloc::CentralFreeList::Populate() () #3 0x00000000007c9c68 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () #4 0x00000000007c9d05 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () #5 0x00000000007cc7b3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) () #6 0x00000000018d6940 in tc_newarray () #7 0x000000000175f47d in kudu::faststring::GrowByAtLeast(unsigned long) () #8 0x000000000173af47 in kudu::PutVarint64(kudu::faststring*, unsigned long) () #9 0x0000000001695235 in kudu::cfile::IndexBlockBuilder::Add(kudu::Slice const&, kudu::cfile::BlockPointer const&) () #10 0x0000000001668f6d in kudu::cfile::IndexTreeBuilder::Append(kudu::Slice const&, kudu::cfile::BlockPointer const&, unsigned long) () #11 0x00000000016692b9 in kudu::cfile::IndexTreeBuilder::FinishBlockAndPropagate(unsigned long) () #12 0x0000000001668faf in kudu::cfile::IndexTreeBuilder::Append(kudu::Slice const&, kudu::cfile::BlockPointer const&, unsigned long) () #13 0x00000000016692b9 in kudu::cfile::IndexTreeBuilder::FinishBlockAndPropagate(unsigned long) () #14 0x0000000001668faf in kudu::cfile::IndexTreeBuilder::Append(kudu::Slice const&, kudu::cfile::BlockPointer const&, unsigned long) () #15 0x00000000016692b9 in kudu::cfile::IndexTreeBuilder::FinishBlockAndPropagate(unsigned long) () part of the log file kudu-tserver.INFO: Log file created at: 2016/04/06 07:59:05 Running on machine: server_30 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg W0406 07:59:05.955095 107415 log_util.cc:166] Could not read footer for segment: /data-1/kudu/wals/20bb1c47d81342178e6015288c694a35.recovery/wal-000000001: Not found: Footer not found. Footer magic doesn't match W0406 07:59:05.955308 107415 log_reader.cc:152] Log segment /data-1/kudu/wals/20bb1c47d81342178e6015288c694a35.recovery/wal-000000001 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data. W0406 07:59:05.955109 107416 log_util.cc:166] Could not read footer for segment: /data-1/kudu/wals/b55629a2199a418e882163f4f8f6571d.recovery/wal-000000001: Not found: Footer not found. Footer magic doesn't match W0406 07:59:05.955386 107416 log_reader.cc:152] Log segment /data-1/kudu/wals/b55629a2199a418e882163f4f8f6571d.recovery/wal-000000001 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data. W0406 07:59:06.002511 107415 log_util.cc:166] Could not read footer for segment: /data-1/kudu/wals/8fdcc4669df14ada93b3a22fb5c7d193.recovery/wal-000000001: Not found: Footer not found. Footer magic doesn't match W0406 07:59:06.002519 107415 log_reader.cc:152] Log segment /data-1/kudu/wals/8fdcc4669df14ada93b3a22fb5c7d193.recovery/wal-000000001 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data. W0406 07:59:06.091740 107416 log_util.cc:166] Could not read footer for segment: /data-1/kudu/wals/69d50bd42616462b86a7aab4fa9369fb.recovery/wal-000000002: Not found: Footer not found. Footer magic doesn't match W0406 07:59:06.091750 107416 log_reader.cc:152] Log segment /data-1/kudu/wals/69d50bd42616462b86a7aab4fa9369fb.recovery/wal-000000002 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data. W0406 07:59:07.521009 107412 leader_election.cc:274] T b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 2358 election: RPC error from VoteRequest() call to peer e598047de9cd4155b61770ca2ec40081: Network error: Client connection negotia tion failed: client connection to 192.168.20.32:7050: connect: Connection refused (error 111) W0406 07:59:07.521826 107412 leader_election.cc:280] T b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 2358 election: Tablet error from VoteRequest() call to peer fbaeb30b25d847a99461528910d82532: Illegal state: Tablet not RUNNING: NO T_STARTED W0406 07:59:07.634618 107412 leader_election.cc:274] T 8fdcc4669df14ada93b3a22fb5c7d193 P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 2407 election: RPC error from VoteRequest() call to peer 44c8d8a1114046a39044c33597e8559b: Network error: Client connection negotia tion failed: client connection to 192.168.20.31:7050: connect: Connection refused (error 111) W0406 07:59:07.635660 107412 leader_election.cc:280] T 8fdcc4669df14ada93b3a22fb5c7d193 P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 2407 election: Tablet error from VoteRequest() call to peer 9a057060c4c547d0a04d34f7501ebcaa: Illegal state: Tablet not RUNNING: NO T_STARTED W0406 07:59:10.784858 107412 leader_election.cc:274] T b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 2359 election: RPC error from VoteRequest() call to peer e598047de9cd4155b61770ca2ec40081: Network error: Client connection negotia tion failed: client connection to 192.168.20.32:7050: connect: Connection refused (error 111) W0406 07:59:10.787231 107558 raft_consensus_state.cc:524] T b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 [term 2359 LEADER]: Can't advance the committed index across term boundaries until operations from the current term are replicated. Last committ ed operation was: term: 2357 index: 7, New majority replicated is: term: 2357 index: 7, Current term is: 2359 W0406 07:59:11.287231 107412 consensus_peers.cc:319] T b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 -> Peer e598047de9cd4155b61770ca2ec40081 (slave22:7050): Couldn't send request to peer e598047de9cd4155b61770ca2ec40081 for tablet b55629a2199a418e88 2163f4f8f6571d Status: Network error: Client connection negotiation failed: client connection to 192.168.20.32:7050: connect: Connection refused (error 111). Retrying in the next heartbeat period. Already tried 1 times. W0406 07:59:11.379604 107412 leader_election.cc:274] T 69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 68 election: RPC error from VoteRequest() call to peer e598047de9cd4155b61770ca2ec40081: Network error: Client connection negotiati on failed: client connection to 192.168.20.32:7050: connect: Connection refused (error 111) W0406 07:59:11.380535 107412 leader_election.cc:280] T 69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 68 election: Tablet error from VoteRequest() call to peer 44c8d8a1114046a39044c33597e8559b: Illegal state: Tablet not RUNNING: BOOT STRAPPING W0406 07:59:11.415874 107412 leader_election.cc:280] T 8fdcc4669df14ada93b3a22fb5c7d193 P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 2408 election: Tablet error from VoteRequest() call to peer 9a057060c4c547d0a04d34f7501ebcaa: Illegal state: Tablet not RUNNING: NO T_STARTED W0406 07:59:11.656478 107412 consensus_peers.cc:319] T b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 -> Peer e598047de9cd4155b61770ca2ec40081 (slave22:7050): Couldn't send request to peer e598047de9cd4155b61770ca2ec40081 for tablet b55629a2199a418e88 2163f4f8f6571d Status: Network error: Client connection negotiation failed: client connection to 192.168.20.32:7050: connect: Connection refused (error 111). Retrying in the next heartbeat period. Already tried 2 times. W0406 07:59:11.882887 107412 consensus_peers.cc:319] T 69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 -> Peer e598047de9cd4155b61770ca2ec40081 (slave22:7050): Couldn't send request to peer e598047de9cd4155b61770ca2ec40081 for tablet 69d50bd42616462b86 a7aab4fa9369fb Status: Network error: Client connection negotiation failed: client connection to 192.168.20.32:7050: connect: Connection refused (error 111). Retrying in the next heartbeat period. Already tried 1 times. W0406 07:59:11.883358 107412 consensus_peers.cc:319] T 69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 -> Peer 44c8d8a1114046a39044c33597e8559b (slave21:7050): Couldn't send request to peer 44c8d8a1114046a39044c33597e8559b for tablet 69d50bd42616462b86 a7aab4fa9369fb Status: Illegal state: Tablet not RUNNING: BOOTSTRAPPING. Retrying in the next heartbeat period. Already tried 1 times. W0406 07:59:11.883757 107578 raft_consensus_state.cc:524] T 69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 [term 68 LEADER]: Can't advance the committed index across term boundaries until operations from the current term are replicated. Last committed operation was: term: 66 index: 31763, New majority replicated is: term: 66 index: 31763, Current term is: 68 does it look like a bug? > >> >> every time some of the tablet servers crash, partial of the data is lost, >> this is so scary. >> > > That's very surprising. A crash of a server should never lose data. How > are you determining that data is lost? We do extensive testing with > crashing servers and have not seen data loss in probably a year or so. > I just do some query like `select count(1) from my_table`, before crash it is about 10, 000, 000, now it is only 20,000, the number is not exact? > -Todd > -- > Todd Lipcon > Software Engineer, Cloudera > --001a1143fa4c8dbb9d052fc64cff Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On Wed, Apr 6, 2016 at 8:23 AM, Todd Lipcon <todd@clouder= a.com> wrote:
On Tue, Apr 5, 2016 at 5:20 PM, Dar= ren Hoo <darren.hoo@gmail.com> wrote:
kudu constantly crashes after running serveral hours.

Do you have anythi= ng in your logs? /var/log/kudu/kudu-tserver.WARNING for example? or .INFO? = The last few lines should be informative as to what is causing it to crash.= =C2=A0
=C2=A0
So I'v= e enabled dumping core in cloudera manager, but where is it located?
<= /div>

It should be located in the sa= me directory as the logs, by default. I think it can be configured, but I&#= 39;d check there first.

=
Thanks! I found it. this is the stacktrace:

<= p style=3D"margin:0px;font-size:11px;line-height:normal;font-family:Menlo">= Core was generated by `/opt/cloudera/parcels/KUDU-0.7.1-1.= kudu0.7.1.p0.36/lib/kudu/sbin/kudu-tserver -'.

Program terminated with signal 11, Segmentation fault.

#0=C2=A0 0x00000000007ca9c8 in tcmalloc::PageHeap::Search= FreeAndLargeLists(unsigned long) ()

Missing separate debuginfos, use: debuginfo-install cyrus= -sasl-gssapi-2.1.23-15.el6_6.2.x86_64 cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_= 64 cyrus-sasl-plain-2.1.23-15.el6_6.2.x86_64 db4-4.7.25-20.el6_7.x86_64 gli= bc-2.12-1.166.el6_7.3.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.= 3-42.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64 libgcc-4.4.7-16.el6.x86_64= libselinux-2.0.94-5.8.el6.x86_64 libstdc++-4.4.7-16.el6.x86_64 ncurses-lib= s-5.7-4.20090207.el6.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 zlib-= 1.2.3-29.el6.x86_64

(gdb) bt

#0=C2=A0 0x00000000007ca9c8 in tcmalloc::PageHeap::Search= FreeAndLargeLists(unsigned long) ()

#1=C2=A0 0x00000000007cb132 in tcmalloc::PageHeap::New(un= signed long) ()

#2=C2=A0 0x00000000007c9a5a in tcmalloc::CentralFreeList:= :Populate() ()

#3=C2=A0 0x00000000007c9c68 in tcmalloc::CentralFreeList:= :FetchFromOneSpansSafe(int, void**, void**) ()

#4=C2=A0 0x00000000007c9d05 in tcmalloc::CentralFreeList:= :RemoveRange(void**, void**, int) ()

#5=C2=A0 0x00000000007cc7b3 in tcmalloc::ThreadCache::Fet= chFromCentralCache(unsigned long, unsigned long) ()

#6=C2=A0 0x00000000018d6940 in tc_newarray ()

#7=C2=A0 0x000000000175f47d in kudu::faststring::GrowByAt= Least(unsigned long) ()

#8=C2=A0 0x000000000173af47 in kudu::PutVarint64(kudu::fa= ststring*, unsigned long) ()

#9=C2=A0 0x0000000001695235 in kudu::cfile::IndexBlockBui= lder::Add(kudu::Slice const&, kudu::cfile::BlockPointer const&) ()<= /span>

#10 0x0000000001668f6d in kudu::cfile::IndexTreeBuilder::= Append(kudu::Slice const&, kudu::cfile::BlockPointer const&, unsign= ed long) ()

#11 0x00000000016692b9 in kudu::cfile::IndexTreeBuilder::= FinishBlockAndPropagate(unsigned long) ()

#12 0x0000000001668faf in kudu::cfile::IndexTreeBuilder::= Append(kudu::Slice const&, kudu::cfile::BlockPointer const&, unsign= ed long) ()

#13 0x00000000016692b9 in kudu::cfile::IndexTreeBuilder::= FinishBlockAndPropagate(unsigned long) ()

#14 0x0000000001668faf in kudu::cfile::IndexTreeBuilder::= Append(kudu::Slice const&, kudu::cfile::BlockPointer const&, unsign= ed long) ()

#15 0x00000000016692b9 in kudu::cfile::IndexTreeBuilder::= FinishBlockAndPropagate(unsigned long) ()


<= /span>


part of the log file kudu-tserv= er.INFO:


Log file created at: 2016/04/06 07:5= 9:05

Running on machine: server_30

<= span style=3D"">Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:l= ine] msg

W0406 07:59:05.955095 107415 log_util.= cc:166] Could not read footer for segment: /data-1/kudu/wals/20bb1c47d81342= 178e6015288c694a35.recovery/wal-000000001: Not found: Footer not found. Foo= ter magic doesn't match

W0406 07:59:05.9553= 08 107415 log_reader.cc:152] Log segment /data-1/kudu/wals/20bb1c47d8134217= 8e6015288c694a35.recovery/wal-000000001 was likely left in-progress after a= previous crash. Will try to rebuild footer by scanning data.

W0406 07:59:05.955109 107416 log_util.cc:166] Could not read= footer for segment: /data-1/kudu/wals/b55629a2199a418e882163f4f8f6571d.rec= overy/wal-000000001: Not found: Footer not found. Footer magic doesn't = match

W0406 07:59:05.955386 107416 log_reader.c= c:152] Log segment /data-1/kudu/wals/b55629a2199a418e882163f4f8f6571d.recov= ery/wal-000000001 was likely left in-progress after a previous crash. Will = try to rebuild footer by scanning data.

W0406 0= 7:59:06.002511 107415 log_util.cc:166] Could not read footer for segment: /= data-1/kudu/wals/8fdcc4669df14ada93b3a22fb5c7d193.recovery/wal-000000001: N= ot found: Footer not found. Footer magic doesn't match

W0406 07:59:06.002519 107415 log_reader.cc:152] Log segment /da= ta-1/kudu/wals/8fdcc4669df14ada93b3a22fb5c7d193.recovery/wal-000000001 was = likely left in-progress after a previous crash. Will try to rebuild footer = by scanning data.

W0406 07:59:06.091740 107416 = log_util.cc:166] Could not read footer for segment: /data-1/kudu/wals/69d50= bd42616462b86a7aab4fa9369fb.recovery/wal-000000002: Not found: Footer not f= ound. Footer magic doesn't match

W0406 07:5= 9:06.091750 107416 log_reader.cc:152] Log segment /data-1/kudu/wals/69d50bd= 42616462b86a7aab4fa9369fb.recovery/wal-000000002 was likely left in-progres= s after a previous crash. Will try to rebuild footer by scanning data.

W0406 07:59:07.521009 107412 leader_election.cc:274= ] T b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 [CA= NDIDATE]: Term 2358 election: RPC error from VoteRequest() call to peer e59= 8047de9cd4155b61770ca2ec40081: Network error: Client connection negotia

tion failed: client connection to 192.168.20.32:7050: connect: Connection refused (= error 111)

W0406 07:59:07.521826 107412 leader_= election.cc:280] T b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5= 409eb9e3537 [CANDIDATE]: Term 2358 election: Tablet error from VoteRequest(= ) call to peer fbaeb30b25d847a99461528910d82532: Illegal state: Tablet not = RUNNING: NO

T_STARTED

W0406 07:59:07.634618 107412 leader_election.cc:274] T 8fdcc4669df14ada9= 3b3a22fb5c7d193 P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 2407 e= lection: RPC error from VoteRequest() call to peer 44c8d8a1114046a39044c335= 97e8559b: Network error: Client connection negotia

tion failed: client connection to 192.168.20.31:7050: connect: Connection refused (error 111)

W0406 07:59:07.635660 107412 leader_election.cc:280] T = 8fdcc4669df14ada93b3a22fb5c7d193 P c7efd632aa6f486c883a5409eb9e3537 [CANDID= ATE]: Term 2407 election: Tablet error from VoteRequest() call to peer 9a05= 7060c4c547d0a04d34f7501ebcaa: Illegal state: Tablet not RUNNING: NO<= /p>

T_STARTED

W0406 07:59:10.7= 84858 107412 leader_election.cc:274] T b55629a2199a418e882163f4f8f6571d P c= 7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 2359 election: RPC error = from VoteRequest() call to peer e598047de9cd4155b61770ca2ec40081: Network e= rror: Client connection negotia

tion failed: cl= ient connection to 192.168.20.32:7050= : connect: Connection refused (error 111)

W= 0406 07:59:10.787231 107558 raft_consensus_state.cc:524] T b55629a2199a418e= 882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 [term 2359 LEADER]: Can= 't advance the committed index across term boundaries until operations = from the current term are replicated. Last committ

ed operation was: term: 2357 index: 7, New majority replicated is: te= rm: 2357 index: 7, Current term is: 2359

W0406 = 07:59:11.287231 107412 consensus_peers.cc:319] T b55629a2199a418e882163f4f8= f6571d P c7efd632aa6f486c883a5409eb9e3537 -> Peer e598047de9cd4155b61770= ca2ec40081 (slave22:7050): Couldn't send request to peer e598047de9cd41= 55b61770ca2ec40081 for tablet b55629a2199a418e88

2163f4f8f6571d Status: Network error: Client connection negotiation faile= d: client connection to 192.168.20.32= :7050: connect: Connection refused (error 111). Retrying in the next he= artbeat period. Already tried 1 times.

W0406 07= :59:11.379604 107412 leader_election.cc:274] T 69d50bd42616462b86a7aab4fa93= 69fb P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 68 election: RPC = error from VoteRequest() call to peer e598047de9cd4155b61770ca2ec40081: Net= work error: Client connection negotiati

on fail= ed: client connection to 192.168.20.3= 2:7050: connect: Connection refused (error 111)

W0406 07:59:11.380535 107412 leader_election.cc:280] T 69d50bd4261646= 2b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 [CANDIDATE]: Term 68 = election: Tablet error from VoteRequest() call to peer 44c8d8a1114046a39044= c33597e8559b: Illegal state: Tablet not RUNNING: BOOT

STRAPPING

W0406 07:59:11.415874 107412 l= eader_election.cc:280] T 8fdcc4669df14ada93b3a22fb5c7d193 P c7efd632aa6f486= c883a5409eb9e3537 [CANDIDATE]: Term 2408 election: Tablet error from VoteRe= quest() call to peer 9a057060c4c547d0a04d34f7501ebcaa: Illegal state: Table= t not RUNNING: NO

T_STARTED

W0406 07:59:11.656478 107412 consensus_peers.cc:319] T b55629a219= 9a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 -> Peer e59804= 7de9cd4155b61770ca2ec40081 (slave22:7050): Couldn't send request to pee= r e598047de9cd4155b61770ca2ec40081 for tablet b55629a2199a418e88

=

2163f4f8f6571d Status: Network error: Client connection n= egotiation failed: client connection to 192.168.20.32:7050: connect: Connection refused (error 111). Retryin= g in the next heartbeat period. Already tried 2 times.

W0406 07:59:11.882887 107412 consensus_peers.cc:319] T 69d50bd426= 16462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 -> Peer e59804= 7de9cd4155b61770ca2ec40081 (slave22:7050): Couldn't send request to pee= r e598047de9cd4155b61770ca2ec40081 for tablet 69d50bd42616462b86

=

a7aab4fa9369fb Status: Network error: Client connection n= egotiation failed: client connection to 192.168.20.32:7050: connect: Connection refused (error 111). Retryin= g in the next heartbeat period. Already tried 1 times.

W0406 07:59:11.883358 107412 consensus_peers.cc:319] T 69d50bd426= 16462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 -> Peer 44c8d8= a1114046a39044c33597e8559b (slave21:7050): Couldn't send request to pee= r 44c8d8a1114046a39044c33597e8559b for tablet 69d50bd42616462b86

=

a7aab4fa9369fb Status: Illegal state: Tablet not RUNNING:= BOOTSTRAPPING. Retrying in the next heartbeat period. Already tried 1 time= s.

W0406 07:59:11.883757 107578 raft_consensus_= state.cc:524] T 69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409= eb9e3537 [term 68 LEADER]: Can't advance the committed index across ter= m boundaries until operations from the current term are replicated. Last co= mmitted

=C2=A0operation was: term: 66 index: 31763, New major= ity replicated is: term: 66 index: 31763, Current term is: 68



does it look like a bug?




=C2=A0

<= /div>
every time some of the tablet servers crash, partial of the data = is lost, this is so scary.

That's very surprising. A crash of a server should never lose data. = How are you determining that data is lost? We do extensive testing with cra= shing servers and have not seen data loss in probably a year or so.

I just do some query like = `select count(1) from my_table`, before crash it is about 10, 000, 000, now= it is only 20,000, the number is not exact?
=C2=A0
-Todd
--
Todd Lipcon
Software Engi= neer, Cloudera

--001a1143fa4c8dbb9d052fc64cff--