kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <danburk...@apache.org>
Subject Re: Question about redistributing tablets on failure of a tserver.
Date Sat, 20 May 2017 15:02:12 GMT
Hey Jason,

What effect did you see with that patch applied?  I've had mixed results
with it in my failover tests - it hasn't resolved some of the issues that I
expected it would, so I'm still looking in to it.  Any feedback you have on
it would be appreciated.

- Dan

On Fri, May 19, 2017 at 10:07 PM, Jason Heo <jason.heo.sde@gmail.com> wrote:

> Thanks, @dan @Todd
>
> This issue has been resolved via https://gerrit.cloudera.org/#/c/6925/
>
> Regards,
>
> Jason
>
> 2017-05-09 4:55 GMT+09:00 Todd Lipcon <todd@cloudera.com>:
>
>> Hey Jason
>>
>> Sorry for the delayed response here. It looks from your ksck like copying
>> is ongoing but hasn't yet finished.
>>
>> FWIW Will B is working on adding more informative output to ksck to help
>> diagnose cases like this:
>> https://gerrit.cloudera.org/#/c/6772/
>>
>> -Todd
>>
>> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <jason.heo.sde@gmail.com>
>> wrote:
>>
>>> @Dan
>>>
>>> I monitored with `kudu ksck` while re-replication is occurring, but I'm
>>> not sure if this output means my cluster has a problem. (It seems just
>>> indicating one tserver stopped)
>>>
>>> Would you please check it?
>>>
>>> Thank,
>>>
>>> Jason
>>>
>>> ```
>>> ...
>>> ...
>>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is
>>> under-replicated: 1 replica(s) not RUNNING
>>>   a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING
>>>   a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING [LEADER]
>>>   401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing
>>>
>>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is
>>> under-replicated: 1 replica(s) not RUNNING
>>>   aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING [LEADER]
>>>   a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING
>>>   31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state
>>>     State:       NOT_STARTED
>>>     Data state:  TABLET_DATA_READY
>>>     Last status: Tablet initializing...
>>>
>>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is
>>> under-replicated: 1 replica(s) not RUNNING
>>>   a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING
>>>   40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING [LEADER]
>>>   aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state
>>>     State:       NOT_STARTED
>>>     Data state:  TABLET_DATA_COPYING
>>>     Last status: TabletCopy: Downloading block 0000000005162382 (277/581)
>>> ...
>>> ...
>>> ==================
>>> Errors:
>>> ==================
>>> table consistency check error: Corruption: 52 table(s) are bad
>>>
>>> FAILED
>>> Runtime error: ksck discovered errors
>>> ```
>>>
>>>
>>>
>>> 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburkert@apache.org>:
>>>
>>>> Hi Jason, answers inline:
>>>>
>>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo.sde@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Q1. Can I disable redistributing tablets on failure of a tserver? The
>>>>> reason why I need this is described in Background.
>>>>>
>>>>
>>>> We don't have any kind of built-in maintenance mode that would prevent
>>>> this, but it can be achieved by setting a flag on each of the tablet
>>>> servers.  The goal is not to disable re-replicating tablets, but instead
to
>>>> avoid kicking the failed replica out of the tablet groups to begin with.
>>>> There is a config flag to control exactly that: 'evict_failed_followers'.
>>>> This isn't considered a stable or supported flag, but it should have the
>>>> effect you are looking for, if you set it to false on each of the tablet
>>>> servers, by running:
>>>>
>>>>     kudu tserver set-flag <tserver-addr> evict_failed_followers false
>>>> --force
>>>>
>>>> for each tablet server.  When you are done, set it back to the default
>>>> 'true' value.  This isn't something we routinely test (especially setting
>>>> it without restarting the server), so please test before trying this on a
>>>> production cluster.
>>>>
>>>> Q2. redistribution goes on even if the failed tserver reconnected to
>>>>> cluster. In my test cluster, it took 2 hours to distribute when a tserver
>>>>> which has 3TB data was killed.
>>>>>
>>>>
>>>> This seems slow.  What's the speed of your network?  How many nodes?
>>>> How many tablet replicas were on the failed tserver, and were the replica
>>>> sizes evenly balanced?  Next time this happens, you might try monitoring
>>>> with 'kudu ksck' to ensure there aren't additional problems in the cluster
(admin guide
>>>> on the ksck tool
>>>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck>
>>>> ).
>>>>
>>>>
>>>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed
>>>>> without restarting cluster?
>>>>>
>>>>
>>>> The flag can be changed, but it comes with the same caveats as above:
>>>>
>>>>     'kudu tserver set-flag <tserver-addr> follower_unavailable_considered_failed_sec
>>>> 900 --force'
>>>>
>>>>
>>>> - Dan
>>>>
>>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Mime
View raw message