hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (Jira)" <j...@apache.org>
Subject [jira] [Created] (HBASE-24779) Improve insight into replication WAL readers hung on checkQuota
Date Mon, 27 Jul 2020 16:01:00 GMT
Josh Elser created HBASE-24779:
----------------------------------

             Summary: Improve insight into replication WAL readers hung on checkQuota
                 Key: HBASE-24779
                 URL: https://issues.apache.org/jira/browse/HBASE-24779
             Project: HBase
          Issue Type: Task
            Reporter: Josh Elser
            Assignee: Josh Elser


Helped a customer this past weekend who, with a large number of RegionServers, has some RegionServers
which replicated data to a peer without issues while other RegionServers did not.

The number of queue logs varied over the past 24hrs in the same manner. Some spikes in queued
logs into 100's of logs, but other times, only 1's-10's of logs were queued.

We were able to validate that there were "good" and "bad" RegionServers by creating a test
table, assigning it to a regionserver, enabling replication on that table, and validating
if the local puts were replicated to a peer. On a good RS, data was replicated immediately.
On a bad RS, data was never replicated (at least, on the order of 10's of minutes which we
waited).

On the "bad RS", we were able to observe that the \{{wal-reader}} thread(s) on that RS were
spending time in a Thread.sleep() in a different location than the other. Specifically it
was sitting in the {{ReplicationSourceWALReader#checkQuota()}}'s sleep call, _not_ the {{handleEmptyWALBatch()}}
method on the same class.

My only assumption is that, somehow, these RegionServers got into a situation where they "allocated"
memory from the quota but never freed it. Then, because the WAL reader thinks it has no free
memory, it blocks indefinitely and there are no pending edits to ship and (thus) free that
memory. A cursory glance at the code gives me a _lot_ of anxiety around places where we don't
properly clean it up (e.g. batches that fail to ship, dropping a peer). As a first stab, let
me add some more debugging so we can actually track this state properly for the operators
and their sanity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message