kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Berkeley <wdberke...@gmail.com>
Subject Re: Segmentation Fault when running kudu ksck
Date Mon, 20 Aug 2018 18:19:11 GMT
That looks like KUDU-2113, which was fixed in 1.6.0.

It happens if the tablet servers report peers in their config that are not
known to the master. Probably, you have removed servers from the cluster
and some of the tablets are in a bad state as a result. These sorts of
problems were unfortunately common on earlier Kudu releases. Every new
version since 5.12 had made significant improvements to prevent these sorts
of situations. I'd recommend upgrading to 1.5, or at least taking a 1.5
kudu tool and running it against the 1.4 cluster to see what the issues are.

-Will

On Mon, Aug 20, 2018 at 10:57 AM, Vincent Kooijman <
vincent.kooijman@onmarc.nl> wrote:

> Hi all,
>
>
>
> We're running into a few Kudu issues with the first being the Kudu cluster
> check utility (sudo -u kudu /opt/cloudera/parcels/CDH/lib/kudu/bin-debug/kudu
> cluster ksck) showing:
>
>
>
> Connected to the Master
>
> Fetched info from all 10 Tablet Servers
>
>
>
> Tablet 41bf41e4127a46c69242f707298cf4ba of table 'xxx' is
> under-replicated: 1 replica(s) not RUNNING
>
>   1b3d49dd6ce64acda32f97a89d7de193: TS unavailable
>
>   1a05af887edf4ba7b5c1731ce3508b19 (pdn05:7050): RUNNING [LEADER]
>
>   4028533287964369928034c3616a0a16 (pdn01:7050): RUNNING
>
>
>
> 2 replicas' active configs differ from the master's.
>
>   All the peers reported by the master and tablet servers are:
>
>   A = 1a05af887edf4ba7b5c1731ce3508b19
>
>   B = 1b3d49dd6ce64acda32f97a89d7de193
>
>   C = 4028533287964369928034c3616a0a16
>
>
>
> *The consensus matrix is:*
>
> *Segmentation fault*
>
>
>
> There is some mention of segmentation fault in combination with ksck in
> the Kudu release notes for 1.4.0, but we are running 1.5.0 on a CDH cluster.
>
>
>
> Some notes:
>
>
>
>    - All masters (we have 3) are up with one leader being elected
>    - All tablet servers (10) are live and visible in the master web UI
>    - We've ran kudu fs check ... -repair on all servers (master & tablet)
>    - Master logs are filled with errors like:
>
>    Previously reported cstate for tablet 5977f01cea44448a908bb56f97b46d9e
>    (table 'xxx' [id=bb359f4b89dd46e797e2e24f9efac971]) gave a different
>    leader for term 2007 than the current cstate. Previous cstate:
>    current_term: 2007 leader_uuid: ""
>
>    - And tablet server logs contain a lot of:
>
>    Couldn't send request to peer 228515616baf44a99561c2b72dfb3bab for
>    tablet 138854a04f804f4ebf42df657c22b995. Error code:
>    TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING:
>    INITIALIZED. Retrying in the next heartbeat period. Already tried 12813
>    times.
>
>
>
> We're a bit lost as to where to look next.
>
>
>
> If anyone can point us in the right direction, that would be great!
>
>
> Thanks,
>
>
>
> Vincent
>

Mime
View raw message