hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Stack (Jira)" <j...@apache.org>
Subject [jira] [Created] (HBASE-23600) Improve chances of edits landing into hbase:meta even when high load
Date Fri, 20 Dec 2019 06:24:00 GMT
Michael Stack created HBASE-23600:

             Summary: Improve chances of edits landing into hbase:meta even when high load
                 Key: HBASE-23600
                 URL: https://issues.apache.org/jira/browse/HBASE-23600
             Project: HBase
          Issue Type: Improvement
          Components: rpc
            Reporter: Michael Stack

Of late I've been testing clusters under high load to study failures and to figure how to
effect recovery if cluster is unable to recover on its own.

One interesting case is a RS that is struggling mostly because writes to HDFS are backed up
and sync calls are running very slow taking a long time to complete. The RPC backs up with
waiting requests, and eventually goes over one or more bounds. The RS then starts throwing
CallQueueTooBigExceptions. This struggling state can last a good while. We throw CQTBEs whatever
the priority of the incoming request.

We throw CQTBE in two places; on original parse of the request before we dispatch it on a
handler -- here we check size of all queues and if over the threshold (default 1G), throw
the exception -- and then later when we dispatch the request to internal queues, we'll count
items in queue and if over default in any one queue (default is 10 * handler count), we'll
fail dispatch and again throw CQTBE.

We shouldn't be running w/ big queues. We should be rejecting Requests we know we'll never
process in time before client loses interest (See the CoDel thesis and the implementations
added a good while back). TODO.

Meantime I was looking to see if having read a high-priority request, if rather than dropping
it on the floor, instead, what would happen if I let it through even if above thresholds?
My main concern is edits to hbase:meta. When sustained, saturated load on the RS carrying
hbase:meta, edits may not land. The result is incomplete Procedures and a disorientated Master.
I was playing w/ trying to put off the corruption as long as possible, experimenting (CoDel
doesn't do priority at first blush; we probably want to add this).

This message was sent by Atlassian Jira

View raw message