kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Heo <jason.heo....@gmail.com>
Subject Re: Question about redistributing tablets on failure of a tserver.
Date Fri, 14 Apr 2017 06:35:27 GMT
@Dan

I monitored with `kudu ksck` while re-replication is occurring, but I'm not
sure if this output means my cluster has a problem. (It seems just
indicating one tserver stopped)

Would you please check it?

Thank,

Jason

```
...
...
Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is
under-replicated: 1 replica(s) not RUNNING
  a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING
  a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING [LEADER]
  401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing

Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is
under-replicated: 1 replica(s) not RUNNING
  aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING [LEADER]
  a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING
  31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state
    State:       NOT_STARTED
    Data state:  TABLET_DATA_READY
    Last status: Tablet initializing...

Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is
under-replicated: 1 replica(s) not RUNNING
  a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING
  40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING [LEADER]
  aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state
    State:       NOT_STARTED
    Data state:  TABLET_DATA_COPYING
    Last status: TabletCopy: Downloading block 0000000005162382 (277/581)
...
...
==================
Errors:
==================
table consistency check error: Corruption: 52 table(s) are bad

FAILED
Runtime error: ksck discovered errors
```



2017-04-13 3:47 GMT+09:00 Dan Burkert <danburkert@apache.org>:

> Hi Jason, answers inline:
>
> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo.sde@gmail.com>
> wrote:
>
>>
>> Q1. Can I disable redistributing tablets on failure of a tserver? The
>> reason why I need this is described in Background.
>>
>
> We don't have any kind of built-in maintenance mode that would prevent
> this, but it can be achieved by setting a flag on each of the tablet
> servers.  The goal is not to disable re-replicating tablets, but instead to
> avoid kicking the failed replica out of the tablet groups to begin with.
> There is a config flag to control exactly that: 'evict_failed_followers'.
> This isn't considered a stable or supported flag, but it should have the
> effect you are looking for, if you set it to false on each of the tablet
> servers, by running:
>
>     kudu tserver set-flag <tserver-addr> evict_failed_followers false
> --force
>
> for each tablet server.  When you are done, set it back to the default
> 'true' value.  This isn't something we routinely test (especially setting
> it without restarting the server), so please test before trying this on a
> production cluster.
>
> Q2. redistribution goes on even if the failed tserver reconnected to
>> cluster. In my test cluster, it took 2 hours to distribute when a tserver
>> which has 3TB data was killed.
>>
>
> This seems slow.  What's the speed of your network?  How many nodes?  How
> many tablet replicas were on the failed tserver, and were the replica sizes
> evenly balanced?  Next time this happens, you might try monitoring with
> 'kudu ksck' to ensure there aren't additional problems in the cluster (admin guide
> on the ksck tool
> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck>
> ).
>
>
>> Q3. `--follower_unavailable_considered_failed_sec` can be changed
>> without restarting cluster?
>>
>
> The flag can be changed, but it comes with the same caveats as above:
>
>     'kudu tserver set-flag <tserver-addr> follower_unavailable_considered_failed_sec
> 900 --force'
>
>
> - Dan
>
>

Mime
View raw message