kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darren Hoo <darren....@gmail.com>
Subject Re: where is kudu's dump core located?
Date Wed, 06 Apr 2016 00:46:57 GMT
On Wed, Apr 6, 2016 at 8:23 AM, Todd Lipcon <todd@cloudera.com> wrote:

> On Tue, Apr 5, 2016 at 5:20 PM, Darren Hoo <darren.hoo@gmail.com> wrote:
>
>> kudu constantly crashes after running serveral hours.
>>
>>
> Do you have anything in your logs? /var/log/kudu/kudu-tserver.WARNING for
> example? or .INFO? The last few lines should be informative as to what is
> causing it to crash.
>

>
>> So I've enabled dumping core in cloudera manager, but where is it located?
>>
>
> It should be located in the same directory as the logs, by default. I
> think it can be configured, but I'd check there first.
>

Thanks! I found it. this is the stacktrace:

Core was generated by
`/opt/cloudera/parcels/KUDU-0.7.1-1.kudu0.7.1.p0.36/lib/kudu/sbin/kudu-tserver
-'.

Program terminated with signal 11, Segmentation fault.

#0  0x00000000007ca9c8 in
tcmalloc::PageHeap::SearchFreeAndLargeLists(unsigned long) ()

Missing separate debuginfos, use: debuginfo-install
cyrus-sasl-gssapi-2.1.23-15.el6_6.2.x86_64
cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64
cyrus-sasl-plain-2.1.23-15.el6_6.2.x86_64 db4-4.7.25-20.el6_7.x86_64
glibc-2.12-1.166.el6_7.3.x86_64 keyutils-libs-1.4-5.el6.x86_64
krb5-libs-1.10.3-42.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64
libgcc-4.4.7-16.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64
libstdc++-4.4.7-16.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64
nss-softokn-freebl-3.14.3-23.el6_7.x86_64 zlib-1.2.3-29.el6.x86_64

(gdb) bt

#0  0x00000000007ca9c8 in
tcmalloc::PageHeap::SearchFreeAndLargeLists(unsigned long) ()

#1  0x00000000007cb132 in tcmalloc::PageHeap::New(unsigned long) ()

#2  0x00000000007c9a5a in tcmalloc::CentralFreeList::Populate() ()

#3  0x00000000007c9c68 in
tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) ()

#4  0x00000000007c9d05 in tcmalloc::CentralFreeList::RemoveRange(void**,
void**, int) ()

#5  0x00000000007cc7b3 in
tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)
()

#6  0x00000000018d6940 in tc_newarray ()

#7  0x000000000175f47d in kudu::faststring::GrowByAtLeast(unsigned long) ()

#8  0x000000000173af47 in kudu::PutVarint64(kudu::faststring*, unsigned
long) ()

#9  0x0000000001695235 in kudu::cfile::IndexBlockBuilder::Add(kudu::Slice
const&, kudu::cfile::BlockPointer const&) ()

#10 0x0000000001668f6d in kudu::cfile::IndexTreeBuilder::Append(kudu::Slice
const&, kudu::cfile::BlockPointer const&, unsigned long) ()

#11 0x00000000016692b9 in
kudu::cfile::IndexTreeBuilder::FinishBlockAndPropagate(unsigned long) ()

#12 0x0000000001668faf in kudu::cfile::IndexTreeBuilder::Append(kudu::Slice
const&, kudu::cfile::BlockPointer const&, unsigned long) ()

#13 0x00000000016692b9 in
kudu::cfile::IndexTreeBuilder::FinishBlockAndPropagate(unsigned long) ()

#14 0x0000000001668faf in kudu::cfile::IndexTreeBuilder::Append(kudu::Slice
const&, kudu::cfile::BlockPointer const&, unsigned long) ()

#15 0x00000000016692b9 in
kudu::cfile::IndexTreeBuilder::FinishBlockAndPropagate(unsigned long) ()



part of the log file kudu-tserver.INFO:


Log file created at: 2016/04/06 07:59:05

Running on machine: server_30

Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg

W0406 07:59:05.955095 107415 log_util.cc:166] Could not read footer for
segment:
/data-1/kudu/wals/20bb1c47d81342178e6015288c694a35.recovery/wal-000000001:
Not found: Footer not found. Footer magic doesn't match

W0406 07:59:05.955308 107415 log_reader.cc:152] Log segment
/data-1/kudu/wals/20bb1c47d81342178e6015288c694a35.recovery/wal-000000001
was likely left in-progress after a previous crash. Will try to rebuild
footer by scanning data.

W0406 07:59:05.955109 107416 log_util.cc:166] Could not read footer for
segment:
/data-1/kudu/wals/b55629a2199a418e882163f4f8f6571d.recovery/wal-000000001:
Not found: Footer not found. Footer magic doesn't match

W0406 07:59:05.955386 107416 log_reader.cc:152] Log segment
/data-1/kudu/wals/b55629a2199a418e882163f4f8f6571d.recovery/wal-000000001
was likely left in-progress after a previous crash. Will try to rebuild
footer by scanning data.

W0406 07:59:06.002511 107415 log_util.cc:166] Could not read footer for
segment:
/data-1/kudu/wals/8fdcc4669df14ada93b3a22fb5c7d193.recovery/wal-000000001:
Not found: Footer not found. Footer magic doesn't match

W0406 07:59:06.002519 107415 log_reader.cc:152] Log segment
/data-1/kudu/wals/8fdcc4669df14ada93b3a22fb5c7d193.recovery/wal-000000001
was likely left in-progress after a previous crash. Will try to rebuild
footer by scanning data.

W0406 07:59:06.091740 107416 log_util.cc:166] Could not read footer for
segment:
/data-1/kudu/wals/69d50bd42616462b86a7aab4fa9369fb.recovery/wal-000000002:
Not found: Footer not found. Footer magic doesn't match

W0406 07:59:06.091750 107416 log_reader.cc:152] Log segment
/data-1/kudu/wals/69d50bd42616462b86a7aab4fa9369fb.recovery/wal-000000002
was likely left in-progress after a previous crash. Will try to rebuild
footer by scanning data.

W0406 07:59:07.521009 107412 leader_election.cc:274] T
b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537
[CANDIDATE]: Term 2358 election: RPC error from VoteRequest() call to peer
e598047de9cd4155b61770ca2ec40081: Network error: Client connection negotia

tion failed: client connection to 192.168.20.32:7050: connect: Connection
refused (error 111)

W0406 07:59:07.521826 107412 leader_election.cc:280] T
b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537
[CANDIDATE]: Term 2358 election: Tablet error from VoteRequest() call to
peer fbaeb30b25d847a99461528910d82532: Illegal state: Tablet not RUNNING: NO

T_STARTED

W0406 07:59:07.634618 107412 leader_election.cc:274] T
8fdcc4669df14ada93b3a22fb5c7d193 P c7efd632aa6f486c883a5409eb9e3537
[CANDIDATE]: Term 2407 election: RPC error from VoteRequest() call to peer
44c8d8a1114046a39044c33597e8559b: Network error: Client connection negotia

tion failed: client connection to 192.168.20.31:7050: connect: Connection
refused (error 111)

W0406 07:59:07.635660 107412 leader_election.cc:280] T
8fdcc4669df14ada93b3a22fb5c7d193 P c7efd632aa6f486c883a5409eb9e3537
[CANDIDATE]: Term 2407 election: Tablet error from VoteRequest() call to
peer 9a057060c4c547d0a04d34f7501ebcaa: Illegal state: Tablet not RUNNING: NO

T_STARTED

W0406 07:59:10.784858 107412 leader_election.cc:274] T
b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537
[CANDIDATE]: Term 2359 election: RPC error from VoteRequest() call to peer
e598047de9cd4155b61770ca2ec40081: Network error: Client connection negotia

tion failed: client connection to 192.168.20.32:7050: connect: Connection
refused (error 111)

W0406 07:59:10.787231 107558 raft_consensus_state.cc:524] T
b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 [term
2359 LEADER]: Can't advance the committed index across term boundaries
until operations from the current term are replicated. Last committ

ed operation was: term: 2357 index: 7, New majority replicated is: term:
2357 index: 7, Current term is: 2359

W0406 07:59:11.287231 107412 consensus_peers.cc:319] T
b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 -> Peer
e598047de9cd4155b61770ca2ec40081 (slave22:7050): Couldn't send request to
peer e598047de9cd4155b61770ca2ec40081 for tablet b55629a2199a418e88

2163f4f8f6571d Status: Network error: Client connection negotiation failed:
client connection to 192.168.20.32:7050: connect: Connection refused (error
111). Retrying in the next heartbeat period. Already tried 1 times.

W0406 07:59:11.379604 107412 leader_election.cc:274] T
69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537
[CANDIDATE]: Term 68 election: RPC error from VoteRequest() call to peer
e598047de9cd4155b61770ca2ec40081: Network error: Client connection negotiati

on failed: client connection to 192.168.20.32:7050: connect: Connection
refused (error 111)

W0406 07:59:11.380535 107412 leader_election.cc:280] T
69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537
[CANDIDATE]: Term 68 election: Tablet error from VoteRequest() call to peer
44c8d8a1114046a39044c33597e8559b: Illegal state: Tablet not RUNNING: BOOT

STRAPPING

W0406 07:59:11.415874 107412 leader_election.cc:280] T
8fdcc4669df14ada93b3a22fb5c7d193 P c7efd632aa6f486c883a5409eb9e3537
[CANDIDATE]: Term 2408 election: Tablet error from VoteRequest() call to
peer 9a057060c4c547d0a04d34f7501ebcaa: Illegal state: Tablet not RUNNING: NO

T_STARTED

W0406 07:59:11.656478 107412 consensus_peers.cc:319] T
b55629a2199a418e882163f4f8f6571d P c7efd632aa6f486c883a5409eb9e3537 -> Peer
e598047de9cd4155b61770ca2ec40081 (slave22:7050): Couldn't send request to
peer e598047de9cd4155b61770ca2ec40081 for tablet b55629a2199a418e88

2163f4f8f6571d Status: Network error: Client connection negotiation failed:
client connection to 192.168.20.32:7050: connect: Connection refused (error
111). Retrying in the next heartbeat period. Already tried 2 times.

W0406 07:59:11.882887 107412 consensus_peers.cc:319] T
69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 -> Peer
e598047de9cd4155b61770ca2ec40081 (slave22:7050): Couldn't send request to
peer e598047de9cd4155b61770ca2ec40081 for tablet 69d50bd42616462b86

a7aab4fa9369fb Status: Network error: Client connection negotiation failed:
client connection to 192.168.20.32:7050: connect: Connection refused (error
111). Retrying in the next heartbeat period. Already tried 1 times.

W0406 07:59:11.883358 107412 consensus_peers.cc:319] T
69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 -> Peer
44c8d8a1114046a39044c33597e8559b (slave21:7050): Couldn't send request to
peer 44c8d8a1114046a39044c33597e8559b for tablet 69d50bd42616462b86

a7aab4fa9369fb Status: Illegal state: Tablet not RUNNING: BOOTSTRAPPING.
Retrying in the next heartbeat period. Already tried 1 times.

W0406 07:59:11.883757 107578 raft_consensus_state.cc:524] T
69d50bd42616462b86a7aab4fa9369fb P c7efd632aa6f486c883a5409eb9e3537 [term
68 LEADER]: Can't advance the committed index across term boundaries until
operations from the current term are replicated. Last committed

 operation was: term: 66 index: 31763, New majority replicated is: term: 66
index: 31763, Current term is: 68


does it look like a bug?





>
>>
>> every time some of the tablet servers crash, partial of the data is lost,
>> this is so scary.
>>
>
> That's very surprising. A crash of a server should never lose data. How
> are you determining that data is lost? We do extensive testing with
> crashing servers and have not seen data loss in probably a year or so.
>

I just do some query like `select count(1) from my_table`, before crash it
is about 10, 000, 000, now it is only 20,000, the number is not exact?


> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Mime
View raw message