Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9B9BE200C7B for ; Sat, 20 May 2017 17:02:56 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 99FE0160BBC; Sat, 20 May 2017 15:02:56 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9197F160BAD for ; Sat, 20 May 2017 17:02:55 +0200 (CEST) Received: (qmail 12815 invoked by uid 500); 20 May 2017 15:02:54 -0000 Mailing-List: contact user-help@kudu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.apache.org Delivered-To: mailing list user@kudu.apache.org Received: (qmail 12803 invoked by uid 99); 20 May 2017 15:02:54 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 May 2017 15:02:54 +0000 Received: from mail-wr0-f182.google.com (mail-wr0-f182.google.com [209.85.128.182]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 28D161A031B for ; Sat, 20 May 2017 15:02:54 +0000 (UTC) Received: by mail-wr0-f182.google.com with SMTP id z52so29946342wrc.2 for ; Sat, 20 May 2017 08:02:54 -0700 (PDT) X-Gm-Message-State: AODbwcBZkSfnThQQcRaZZA5e/ifpRd7gahc5dL7AlnhFlUUBZ/fHzY9V YzhtbU6TKA063NAYQy7Z8vR4cIMG/2aJ X-Received: by 10.223.155.2 with SMTP id b2mr6224398wrc.87.1495292572671; Sat, 20 May 2017 08:02:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.162.211 with HTTP; Sat, 20 May 2017 08:02:12 -0700 (PDT) In-Reply-To: References: From: Dan Burkert Date: Sat, 20 May 2017 08:02:12 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Question about redistributing tablets on failure of a tserver. To: user@kudu.apache.org Content-Type: multipart/alternative; boundary="94eb2c1cd484a154f3054ff5ee33" archived-at: Sat, 20 May 2017 15:02:56 -0000 --94eb2c1cd484a154f3054ff5ee33 Content-Type: text/plain; charset="UTF-8" Hey Jason, What effect did you see with that patch applied? I've had mixed results with it in my failover tests - it hasn't resolved some of the issues that I expected it would, so I'm still looking in to it. Any feedback you have on it would be appreciated. - Dan On Fri, May 19, 2017 at 10:07 PM, Jason Heo wrote: > Thanks, @dan @Todd > > This issue has been resolved via https://gerrit.cloudera.org/#/c/6925/ > > Regards, > > Jason > > 2017-05-09 4:55 GMT+09:00 Todd Lipcon : > >> Hey Jason >> >> Sorry for the delayed response here. It looks from your ksck like copying >> is ongoing but hasn't yet finished. >> >> FWIW Will B is working on adding more informative output to ksck to help >> diagnose cases like this: >> https://gerrit.cloudera.org/#/c/6772/ >> >> -Todd >> >> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo >> wrote: >> >>> @Dan >>> >>> I monitored with `kudu ksck` while re-replication is occurring, but I'm >>> not sure if this output means my cluster has a problem. (It seems just >>> indicating one tserver stopped) >>> >>> Would you please check it? >>> >>> Thank, >>> >>> Jason >>> >>> ``` >>> ... >>> ... >>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is >>> under-replicated: 1 replica(s) not RUNNING >>> a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING >>> a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING [LEADER] >>> 401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing >>> >>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is >>> under-replicated: 1 replica(s) not RUNNING >>> aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING [LEADER] >>> a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING >>> 31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state >>> State: NOT_STARTED >>> Data state: TABLET_DATA_READY >>> Last status: Tablet initializing... >>> >>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is >>> under-replicated: 1 replica(s) not RUNNING >>> a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING >>> 40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING [LEADER] >>> aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state >>> State: NOT_STARTED >>> Data state: TABLET_DATA_COPYING >>> Last status: TabletCopy: Downloading block 0000000005162382 (277/581) >>> ... >>> ... >>> ================== >>> Errors: >>> ================== >>> table consistency check error: Corruption: 52 table(s) are bad >>> >>> FAILED >>> Runtime error: ksck discovered errors >>> ``` >>> >>> >>> >>> 2017-04-13 3:47 GMT+09:00 Dan Burkert : >>> >>>> Hi Jason, answers inline: >>>> >>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo >>>> wrote: >>>> >>>>> >>>>> Q1. Can I disable redistributing tablets on failure of a tserver? The >>>>> reason why I need this is described in Background. >>>>> >>>> >>>> We don't have any kind of built-in maintenance mode that would prevent >>>> this, but it can be achieved by setting a flag on each of the tablet >>>> servers. The goal is not to disable re-replicating tablets, but instead to >>>> avoid kicking the failed replica out of the tablet groups to begin with. >>>> There is a config flag to control exactly that: 'evict_failed_followers'. >>>> This isn't considered a stable or supported flag, but it should have the >>>> effect you are looking for, if you set it to false on each of the tablet >>>> servers, by running: >>>> >>>> kudu tserver set-flag evict_failed_followers false >>>> --force >>>> >>>> for each tablet server. When you are done, set it back to the default >>>> 'true' value. This isn't something we routinely test (especially setting >>>> it without restarting the server), so please test before trying this on a >>>> production cluster. >>>> >>>> Q2. redistribution goes on even if the failed tserver reconnected to >>>>> cluster. In my test cluster, it took 2 hours to distribute when a tserver >>>>> which has 3TB data was killed. >>>>> >>>> >>>> This seems slow. What's the speed of your network? How many nodes? >>>> How many tablet replicas were on the failed tserver, and were the replica >>>> sizes evenly balanced? Next time this happens, you might try monitoring >>>> with 'kudu ksck' to ensure there aren't additional problems in the cluster (admin guide >>>> on the ksck tool >>>> >>>> ). >>>> >>>> >>>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed >>>>> without restarting cluster? >>>>> >>>> >>>> The flag can be changed, but it comes with the same caveats as above: >>>> >>>> 'kudu tserver set-flag follower_unavailable_considered_failed_sec >>>> 900 --force' >>>> >>>> >>>> - Dan >>>> >>>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > --94eb2c1cd484a154f3054ff5ee33 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hey Jason,

What effect did you see with= that patch applied?=C2=A0 I've had mixed results with it in my failove= r tests - it hasn't resolved some of the issues that I expected it woul= d, so I'm still looking in to it.=C2=A0 Any feedback you have on it wou= ld be appreciated.

- Dan

On Fri, May 19, 2017 at 10:07 P= M, Jason Heo <jason.heo.sde@gmail.com> wrote:
Thanks, @dan @Todd

This issue has been resolved via=C2=A0https://gerrit.cloudera.org/#/c/6= 925/

Regards,

Jason

2017-05-09 4:55 GMT+09:00 Todd Lipcon <to= dd@cloudera.com>:
Hey Jason

Sorry for the delayed response here. = It looks from your ksck like copying is ongoing but hasn't yet finished= .

FWIW Will B is working on adding more informativ= e output to ksck to help diagnose cases like this:

-Todd

On Thu, Apr 13, 2017 at 11:35 PM, Jas= on Heo <jason.heo.sde@gmail.com> wrote:
@Dan

I monitored wi= th `kudu ksck` while re-replication is occurring, but I'm not sure if t= his output means my cluster has a problem. (It seems just indicating one ts= erver stopped)

Would you please check it?

Thank,

Jason

=
```
...
...
Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of t= able 'impala::tbl1' is under-replicated: 1 replica(s) not RUNNING
=C2=A0 a7ca07f9bXXXXXXX= XXXXXXXXbbb21cfb (hostname.com:7050): RUNNING
=C2=A0 a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING [LEA= DER]
=C2=A0 401b6XXXXX= XXXXXXXXXX5feda1de212b (hostname.com:7050): missing

=C2=A0 aec55b4XXXXXXXXXXXXXXXdb469427cf (<= a href=3D"http://hostname.com:7050" target=3D"_blank">hostname.com:7050= ): RUNNING [LEADER]
= =C2=A0 a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING
= =C2=A0 31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:= 7050): bad state
= =C2=A0 =C2=A0 State: =C2=A0 =C2=A0 =C2=A0 NOT_STARTED
=C2=A0 =C2=A0 Data state: =C2=A0TABLET_DATA= _READY
=C2=A0 =C2=A0 L= ast status: Tablet initializing...

Tabl= et 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is= under-replicated: 1 replica(s) not RUNNING
=C2=A0 a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUN= NING
=C2=A0 40XXXXXXXX= XXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING [LEADER]
=C2=A0 aec55b4e2acXXXXXXXXXXXXXXX9427cf (<= a href=3D"http://hostname.com:7050" target=3D"_blank">hostname.com:7050= ): bad state
=C2=A0 = =C2=A0 State: =C2=A0 =C2=A0 =C2=A0 NOT_STARTED
=C2=A0 =C2=A0 Data state: =C2=A0TABLET_DATA_COPYI= NG
=C2=A0 =C2=A0 Last = status: TabletCopy: Downloading block 0000000005162382 (277/581)
...
...
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=
Errors:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D
table consisten= cy check error: Corruption: 52 table(s) are bad

FAILED
Runtim= e error: ksck discovered errors
```



2017-04-13 3:47 GMT+09:00 Dan Burkert <danburkert@apache.org>:
Hi Jason, answers inline:

On Wed, Apr 12, 2017 at 5:53 AM, Jason He= o <jason.heo.sde@gmail.com> wrote:

Q1= . Can I disable=C2=A0redistributing tablets on failure of a tserver? The re= ason why I need this is described in Background.

We don't have any kind of built-in maintenance= mode that would prevent this, but it can be achieved by setting a flag on = each of the tablet servers.=C2=A0 The goal is not to disable re-replicating= tablets, but instead to avoid kicking the failed replica out of the tablet= groups to begin with.=C2=A0 There is a config flag to control exactly that= : 'evict_failed_followers'.=C2=A0 This isn't considered a stabl= e or supported flag, but it should have the effect you are looking for, if = you set it to false on each of the tablet servers, by running:
=C2=A0 =C2=A0 kudu tserver set-flag <tserver-addr> evict= _failed_followers false --force

for each tablet se= rver.=C2=A0 When you are done, set it back to the default 'true' va= lue.=C2=A0 This isn't something we routinely test (especially setting i= t without restarting the server), so please test before trying this on a pr= oduction cluster.

Q2. redistribution goes on = even if the failed tserver reconnected to cluster. In my test cluster, it t= ook 2 hours to distribute when a tserver which has 3TB data was killed.

This seems slow.=C2=A0 What= 's the speed of your network?=C2=A0 How many nodes?=C2=A0 How many tabl= et replicas were on the failed tserver, and were the replica sizes evenly b= alanced?=C2=A0 Next time this happens, you might try monitoring with 'k= udu ksck' to ensure there aren't additional problems in the cluster= (admin=C2=A0guide on the ksck tool).
=
=C2=A0
Q3. `--follower_unavailable_considered_fail= ed_sec` can be changed without restarting cluster?
=

The flag can be changed, but it comes with the s= ame caveats as above:

=C2=A0 =C2=A0 'kudu tser= ver set-flag <tserver-addr> follower_unavailable_considered_fail= ed_sec 900 --force'


- Dan





-= -
Todd Lipcon
Software Engineer, C= loudera


--94eb2c1cd484a154f3054ff5ee33--