This is an automated email from the ASF dual-hosted git repository.
adar pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git
commit 28c706722891d20aada5d8bee4cfafe456c89561
Author: Will Berkeley <wdberkeley@gmail.com>
AuthorDate: Fri Mar 15 14:38:38 2019 -0700
KUDU-2748 Leader master erroneously tries to tablet copy to a follower master due to race
at startup
The initialization of the master works as follows:
1. Register RPC services.
2. Init catalog manager asynchronously.
As a result, if a master in a multimaster cluster with a healthy leader
starts, there is a brief period of time when a call to UpdateConsensus
from the leader master will hit a CatalogManager and SysTable that are
not initialized. The initializing master will respond TABLET_NOT_FOUND
to the leader, which will cause the leader master to initiate the tablet
copy process. This is a dead end because masters don't support tablet
copy. Things are stuck until there is a leadership change or the
"orphaned" master is restarted again.
Tablets on tablet servers are not vulnerable to this because their
startup order is
1. Init the ts tablet manager synchronously.
2. Register RPC services.
So it is not possible for an UpdateConsensus call to query a ts tablet
manager that hasn't loaded all of the initial tablets.
The fix is pretty simple: recognize and return the StatusUnavailable
returned by the tablet lookup for the master tablet, instead of
TABLET_NOT_FOUND. This will cause the leader master to retry until the
initializing master has finished initializing.
This was the cause of flakiness in KUDU-2734. Without the fix, about 8%
of runs fail on TSAN with 8 stress threads. With the fix, about 0.3% do
(and in 2000 runs with 6 failures I verified that none of the 6 were due
to this issue).
Change-Id: Ib86548085e45ed5cd987d99e227a1af84bf801e7
Reviewed-on: http://gerrit.cloudera.org:8080/12770
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <adar@cloudera.com>
Reviewed-by: Grant Henke <granthenke@apache.org>
Reviewed-by: Alexey Serbin <aserbin@cloudera.com>
---
src/kudu/tserver/tablet_service.cc | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/kudu/tserver/tablet_service.cc b/src/kudu/tserver/tablet_service.cc
index 5d89ea1..010b35b 100644
--- a/src/kudu/tserver/tablet_service.cc
+++ b/src/kudu/tserver/tablet_service.cc
@@ -230,8 +230,15 @@ bool LookupTabletReplicaOrRespond(TabletReplicaLookupIf* tablet_manager,
scoped_refptr<TabletReplica>* replica) {
Status s = tablet_manager->GetTabletReplica(tablet_id, replica);
if (PREDICT_FALSE(!s.ok())) {
- SetupErrorAndRespond(resp->mutable_error(), s,
- TabletServerErrorPB::TABLET_NOT_FOUND, context);
+ if (s.IsServiceUnavailable()) {
+ // If the tablet manager isn't initialized, the remote should check again
+ // soon.
+ SetupErrorAndRespond(resp->mutable_error(), s,
+ TabletServerErrorPB::UNKNOWN_ERROR, context);
+ } else {
+ SetupErrorAndRespond(resp->mutable_error(), s,
+ TabletServerErrorPB::TABLET_NOT_FOUND, context);
+ }
return false;
}
return true;
|