kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ale...@apache.org
Subject [1/2] kudu git commit: Improve consensus queue overflow logging
Date Mon, 29 Oct 2018 18:34:12 GMT
Repository: kudu
Updated Branches:
  refs/heads/master eb6f4e8a8 -> 2c5c9d06b


Improve consensus queue overflow logging

Suppose tablet server X is a leader of T tablets for which tablet server Y is a
follower. The relevant situation is when T is on the order of 100-1000. If Y
strains under its consensus load and falls behind processing consensus service
requests, UpdateConsensus requests from the leader will get rejected and cause
a message to be logged on the leader X for each of the T tablets. The message
looks like:

W1022 17:20:59.767554 13057 consensus_peers.cc:422] T 9255fdf03ad4451e9fcd62f26741bfe6 P 892cc0d4442c4cdaaf633ed2732f9246
-> Peer dc0af5867d52468f8fd47abf13c08040 (tablet_server_Y.kudu.com:7050): Couldn't send
request to peer dc0af5867d52468f8fd47abf13c08040 for tablet 9255fdf03ad4451e9fcd62f26741bfe6.
Status: Remote error: Service unavailable: UpdateConsensus request on kudu.consensus.ConsensusService
from 10.1.1.1:55528 dropped due to backpressure. The service queue is full; it has 50 items..
Retrying in the next heartbeat period. Already tried 1 times.

Y's consensus service pool also logs the same thing, but it doesn't have the
information about the tablet id or peer ids available to it, and it is throttled
to occur no more than once per second:

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService
from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50
items.

This patch attempts to reduce the spam of the first message in the logs
by throttling it to occur once every 5 retries. It still is logged for
every tablet peer, but those messages are useful if one wants to trace
the history of a particular tablet.

I also added the throttling messages to Y's output, so it's now

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService
from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50
items. [suppressed 5 similar messages]

when e.g. 5 other messages have been suppressed.

Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Reviewed-on: http://gerrit.cloudera.org:8080/11801
Tested-by: Will Berkeley <wdberkeley@gmail.com>
Reviewed-by: Alexey Serbin <aserbin@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/be64560f
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/be64560f
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/be64560f

Branch: refs/heads/master
Commit: be64560fe44c0466684cb9627e8bf515a7b96726
Parents: eb6f4e8
Author: Will Berkeley <wdberkeley@gmail.org>
Authored: Fri Oct 26 11:17:20 2018 -0700
Committer: Will Berkeley <wdberkeley@gmail.com>
Committed: Fri Oct 26 20:45:33 2018 +0000

----------------------------------------------------------------------
 src/kudu/consensus/consensus_peers.cc | 25 +++++++++++++++++++------
 src/kudu/consensus/consensus_peers.h  |  3 +--
 src/kudu/rpc/service_pool.cc          |  2 +-
 src/kudu/util/logging.h               |  2 +-
 4 files changed, 22 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/be64560f/src/kudu/consensus/consensus_peers.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_peers.cc b/src/kudu/consensus/consensus_peers.cc
index a99b2c0..8d6b5e7 100644
--- a/src/kudu/consensus/consensus_peers.cc
+++ b/src/kudu/consensus/consensus_peers.cc
@@ -100,6 +100,9 @@ using strings::Substitute;
 namespace kudu {
 namespace consensus {
 
+// The number of retries between failed requests whose failure is logged.
+constexpr auto kNumRetriesBetweenLoggingFailedRequest = 5;
+
 Status Peer::NewRemotePeer(RaftPeerPB peer_pb,
                            string tablet_id,
                            string leader_uuid,
@@ -444,12 +447,22 @@ void Peer::ProcessResponseError(const Status& status) {
                                TabletServerErrorPB::Code_Name(response_.error().code()),
                                response_.error().code());
   }
-  LOG_WITH_PREFIX_UNLOCKED(WARNING) << "Couldn't send request to peer " << peer_pb_.permanent_uuid()
-      << " for tablet " << tablet_id_ << "."
-      << resp_err_info
-      << " Status: " << status.ToString() << "."
-      << " Retrying in the next heartbeat period."
-      << " Already tried " << failed_attempts_ << " times.";
+  // We log the warning at the first failure, then every
+  // 'kNumRetriesBetweenLoggingFailedRequest' retries.
+  // TODO(wdberkeley): If a use case comes up elsewhere, consider adding a
+  // KLOG_EVERY_N macro that supports an appropriate LogThrottler. For now,
+  // this class has 'failed_attempts_' available so it's overkill to add
+  // the throttler support.
+  if (failed_attempts_ % kNumRetriesBetweenLoggingFailedRequest == 1) {
+    LOG_WITH_PREFIX_UNLOCKED(WARNING) <<
+      Substitute("Couldn't send request to peer $0.$1 Status: $2. This is "
+                 "attempt $3: this message will repeat every $4th retry.",
+                 peer_pb_.permanent_uuid(),
+                 resp_err_info,
+                 status.ToString(),
+                 failed_attempts_,
+                 kNumRetriesBetweenLoggingFailedRequest);
+  }
   request_pending_ = false;
 }
 

http://git-wip-us.apache.org/repos/asf/kudu/blob/be64560f/src/kudu/consensus/consensus_peers.h
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_peers.h b/src/kudu/consensus/consensus_peers.h
index b111fe9..3bcdcb5 100644
--- a/src/kudu/consensus/consensus_peers.h
+++ b/src/kudu/consensus/consensus_peers.h
@@ -189,12 +189,11 @@ class Peer : public std::enable_shared_from_this<Peer> {
   // Repeating timer responsible for scheduling heartbeats to this peer.
   std::shared_ptr<rpc::PeriodicTimer> heartbeater_;
 
-  // lock that protects Peer state changes, initialization, etc.
+  // Lock that protects Peer state changes, initialization, etc.
   mutable simple_spinlock peer_lock_;
   bool request_pending_ = false;
   bool closed_ = false;
   bool has_sent_first_request_ = false;
-
 };
 
 // A proxy to another peer. Usually a thin wrapper around an rpc proxy but can

http://git-wip-us.apache.org/repos/asf/kudu/blob/be64560f/src/kudu/rpc/service_pool.cc
----------------------------------------------------------------------
diff --git a/src/kudu/rpc/service_pool.cc b/src/kudu/rpc/service_pool.cc
index 62d46d6..50a0af7 100644
--- a/src/kudu/rpc/service_pool.cc
+++ b/src/kudu/rpc/service_pool.cc
@@ -126,7 +126,7 @@ void ServicePool::RejectTooBusy(InboundCall* c) {
                  c->remote_address().ToString(),
                  service_queue_.max_size());
   rpcs_queue_overflow_->Increment();
-  KLOG_EVERY_N_SECS(WARNING, 1) << err_msg;
+  KLOG_EVERY_N_SECS(WARNING, 1) << err_msg << THROTTLE_MSG;
   c->RespondFailure(ErrorStatusPB::ERROR_SERVER_TOO_BUSY,
                     Status::ServiceUnavailable(err_msg));
   DLOG(INFO) << err_msg << " Contents of service queue:\n"

http://git-wip-us.apache.org/repos/asf/kudu/blob/be64560f/src/kudu/util/logging.h
----------------------------------------------------------------------
diff --git a/src/kudu/util/logging.h b/src/kudu/util/logging.h
index 428dadc..e041831 100644
--- a/src/kudu/util/logging.h
+++ b/src/kudu/util/logging.h
@@ -121,7 +121,7 @@ class ScopedDisableRedaction {
 // the given severity.
 //
 // The log message may include the special token 'THROTTLE_MSG' which expands
-// to either an empty string or '[suppressed <n> similar messages]'.
+// to either an empty string or ' [suppressed <n> similar messages]'.
 //
 // Example usage:
 //   KLOG_EVERY_N_SECS(WARNING, 1) << "server is low on memory" << THROTTLE_MSG;


Mime
View raw message