kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a...@apache.org
Subject [kudu] 02/02: KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race at startup
Date Thu, 21 Mar 2019 01:05:17 GMT
This is an automated email from the ASF dual-hosted git repository.

adar pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit 28c706722891d20aada5d8bee4cfafe456c89561
Author: Will Berkeley <wdberkeley@gmail.com>
AuthorDate: Fri Mar 15 14:38:38 2019 -0700

    KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race
at startup
    The initialization of the master works as follows:
    1. Register RPC services.
    2. Init catalog manager asynchronously.
    As a result, if a master in a multimaster cluster with a healthy leader
    starts, there is a brief period of time when a call to UpdateConsensus
    from the leader master will hit a CatalogManager and SysTable that are
    not initialized. The initializing master will respond TABLET_NOT_FOUND
    to the leader, which will cause the leader master to initiate the tablet
    copy process. This is a dead end because masters don't support tablet
    copy. Things are stuck until there is a leadership change or the
    "orphaned" master is restarted again.
    Tablets on tablet servers are not vulnerable to this because their
    startup order is
    1. Init the ts tablet manager synchronously.
    2. Register RPC services.
    So it is not possible for an UpdateConsensus call to query a ts tablet
    manager that hasn't loaded all of the initial tablets.
    The fix is pretty simple: recognize and return the StatusUnavailable
    returned by the tablet lookup for the master tablet, instead of
    TABLET_NOT_FOUND. This will cause the leader master to retry until the
    initializing master has finished initializing.
    This was the cause of flakiness in KUDU-2734. Without the fix, about 8%
    of runs fail on TSAN with 8 stress threads. With the fix, about 0.3% do
    (and in 2000 runs with 6 failures I verified that none of the 6 were due
    to this issue).
    Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
    Reviewed-on: http://gerrit.cloudera.org:8080/12770
    Tested-by: Kudu Jenkins
    Reviewed-by: Adar Dembo <adar@cloudera.com>
    Reviewed-by: Grant Henke <granthenke@apache.org>
    Reviewed-by: Alexey Serbin <aserbin@cloudera.com>
 src/kudu/tserver/tablet_service.cc | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/kudu/tserver/tablet_service.cc b/src/kudu/tserver/tablet_service.cc
index 5d89ea1..010b35b 100644
--- a/src/kudu/tserver/tablet_service.cc
+++ b/src/kudu/tserver/tablet_service.cc
@@ -230,8 +230,15 @@ bool LookupTabletReplicaOrRespond(TabletReplicaLookupIf* tablet_manager,
                                   scoped_refptr<TabletReplica>* replica) {
   Status s = tablet_manager->GetTabletReplica(tablet_id, replica);
   if (PREDICT_FALSE(!s.ok())) {
-    SetupErrorAndRespond(resp->mutable_error(), s,
-                         TabletServerErrorPB::TABLET_NOT_FOUND, context);
+    if (s.IsServiceUnavailable()) {
+      // If the tablet manager isn't initialized, the remote should check again
+      // soon.
+      SetupErrorAndRespond(resp->mutable_error(), s,
+                           TabletServerErrorPB::UNKNOWN_ERROR, context);
+    } else {
+      SetupErrorAndRespond(resp->mutable_error(), s,
+                           TabletServerErrorPB::TABLET_NOT_FOUND, context);
+    }
     return false;
   return true;

View raw message