kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mpe...@apache.org
Subject [kudu] 02/03: client: avoid accessing state after scheduling retry
Date Thu, 17 Jan 2019 19:06:53 GMT
This is an automated email from the ASF dual-hosted git repository.

mpercy pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit 3450c78d2990dfc65c38e47661901fc68fdf8ced
Author: Andrew Wong <awong@cloudera.com>
AuthorDate: Sat Jan 5 22:22:19 2019 -0800

    client: avoid accessing state after scheduling retry
    
    I saw a failure of MasterMigrationTest.TestEndToEndMigration in which
    the stack watchdog logged the following stacktrace:
    
    User stack:
        @     0x7fa3e575c330  (unknown) at ??:0
        @           0x52dc09  __sanitizer::internal_read() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_syscall_linux_x86_64.inc:46
(discriminator 7)
        @           0x52f79f  __sanitizer::ReadFromFile() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_posix.cc:176
        @           0x53acb9  __sanitizer::SymbolizerProcess::ReadFromSymbolizer() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:468
        @           0x53ba76  __sanitizer::SymbolizerProcess::SendCommand() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:445
        @           0x53c3c5  __sanitizer::Symbolizer::SymbolizePC() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:356
        @           0x539673  __sanitizer::StackTrace::Print() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_stacktrace_libcdep.cc:36
        @           0x541c44  MaybePrintStackTrace() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_diag.cc:48
        @           0x5422ee  __ubsan::ScopedReport::~ScopedReport() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_diag.cc:73
        @           0x549608  HandleDynamicTypeCacheMiss() at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_handlers_cxx.cc:81
        @           0x549a43  __ubsan_handle_dynamic_type_cache_miss_abort at /data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_handlers_cxx.cc:93
        @     0x7fa3f0086643  _ZZN4kudu6client8internal20AsyncLeaderMasterRpcINS_6master23GetTableSchemaRequestPBENS3_24GetTableSchemaResponsePBEE27RetryOrReconnectIfNecessaryEPNS_6StatusEENKUlvE_clEv
at ??:0
        @     0x7fa3f0059b8c  _ZN4kudu13ScopedCleanupIZNS_6client8internal20AsyncLeaderMasterRpcINS_6master23GetTableSchemaRequestPBENS4_24GetTableSchemaResponsePBEE27RetryOrReconnectIfNecessaryEPNS_6StatusEEUlvE_ED2Ev
at ??:0
        @     0x7fa3f005945c  kudu::client::internal::AsyncLeaderMasterRpc<>::RetryOrReconnectIfNecessary()
at ??:0
        @     0x7fa3f0057d0a  kudu::client::internal::AsyncLeaderMasterRpc<>::SendRpcCb()
at ??:0
        @     0x7fa3f0085a85  _ZNSt5_BindIFSt7_Mem_fnIMN4kudu6client8internal20AsyncLeaderMasterRpcINS1_6master23GetTableSchemaRequestPBENS5_24GetTableSchemaResponsePBEEEFvRKNS1_6StatusEEEPS8_S9_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
at ??:0
        #19 0x7fa3ddc84ffc in clone sysdeps/unix/sysv/linux/x86_64/clone.S:111
    
    Upon looking into this and similar failures, it seems like the following
    series of events is possible:
    
    T1: allocate an AsyncLeaderMasterRpc on the stack with
        Synchronizer.Wait() as the user-specified callback
    T2: the RPC call results in an error that makes it retry via
        RetryOrReconnectIfNecessary(); a retry is scheduled
    T3: the retry completes and calls the user-specified callback to begin
        execution on T1
    T1: with the RPC call completed, destroys RPC object
    T2: the KLOG messages attempt to access state from the destroyed RPC
        object and hit a crash, race, undefined behavior, etc.
    
    The fix is to make a copy of the RPC's state that was to be logged so
    there's no chance that T1's destruction of the RPC will affect T2. I
    tested this by looping a test[1] that repeatedly called
    Client::IsCreateTableInProgress() on a multi-master cluster configured
    to have election storms, and observing 5/1000 that yielded TSAN errors
    around the logging calls in RetryOrReconnectIfNecessary. With this
    patch, 1000/1000 pass.
    
    [1] https://gist.github.com/andrwng/5d552f75a2e0d671b7ed54dd01892c66
    
    Change-Id: I8cefd9613018247a1a25d17adedc021e8be166f6
    Reviewed-on: http://gerrit.cloudera.org:8080/12170
    Tested-by: Kudu Jenkins
    Reviewed-by: Adar Dembo <adar@cloudera.com>
    Reviewed-by: Alexey Serbin <aserbin@cloudera.com>
---
 src/kudu/client/master_proxy_rpc.cc | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/kudu/client/master_proxy_rpc.cc b/src/kudu/client/master_proxy_rpc.cc
index 393a681..326df12 100644
--- a/src/kudu/client/master_proxy_rpc.cc
+++ b/src/kudu/client/master_proxy_rpc.cc
@@ -182,10 +182,14 @@ void AsyncLeaderMasterRpc<ReqClass, RespClass>::ResetMasterLeaderAndRetry(
 template <class ReqClass, class RespClass>
 bool AsyncLeaderMasterRpc<ReqClass, RespClass>::RetryOrReconnectIfNecessary(
     Status* status) {
-  auto warn_on_retry = MakeScopedCleanup([&] {
-    KLOG_EVERY_N_SECS(WARNING, 1) <<
-        Substitute("Re-attempting $0 request to leader Master ($1)",
-            rpc_name_, client_->data_->leader_master_hostport().ToString());
+  const string retry_warning = Substitute("Re-attempting $0 request to leader Master ($1)",
+      rpc_name_, client_->data_->leader_master_hostport().ToString());
+  auto warn_on_retry = MakeScopedCleanup([&retry_warning] {
+    // NOTE: we pass in a ref to 'retry_warning' rather than evaluating it here
+    // because any of the below retries may end up completing and calling
+    // 'user_cb_', which may very well destroy this Rpc object and render
+    // 'rpc_name_' and 'client_' inaccessible.
+    KLOG_EVERY_N_SECS(WARNING, 1) << retry_warning;
   });
   // Pull out the RPC status.
   const bool is_multi_master = client_->IsMultiMaster();


Mime
View raw message