From dev-return-77246-archive-asf-public=cust-asf.ponee.io@hbase.apache.org Fri Dec 20 06:24:03 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2656D18064C for ; Fri, 20 Dec 2019 07:24:03 +0100 (CET) Received: (qmail 61180 invoked by uid 500); 20 Dec 2019 06:24:01 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 61155 invoked by uid 99); 20 Dec 2019 06:24:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Dec 2019 06:24:01 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 93AC2E2615 for ; Fri, 20 Dec 2019 06:24:00 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 12AA2780280 for ; Fri, 20 Dec 2019 06:24:00 +0000 (UTC) Date: Fri, 20 Dec 2019 06:24:00 +0000 (UTC) From: "Michael Stack (Jira)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-23600) Improve chances of edits landing into hbase:meta even when high load MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Michael Stack created HBASE-23600: ------------------------------------- Summary: Improve chances of edits landing into hbase:meta even when high load Key: HBASE-23600 URL: https://issues.apache.org/jira/browse/HBASE-23600 Project: HBase Issue Type: Improvement Components: rpc Reporter: Michael Stack Of late I've been testing clusters under high load to study failures and to figure how to effect recovery if cluster is unable to recover on its own. One interesting case is a RS that is struggling mostly because writes to HDFS are backed up and sync calls are running very slow taking a long time to complete. The RPC backs up with waiting requests, and eventually goes over one or more bounds. The RS then starts throwing CallQueueTooBigExceptions. This struggling state can last a good while. We throw CQTBEs whatever the priority of the incoming request. We throw CQTBE in two places; on original parse of the request before we dispatch it on a handler -- here we check size of all queues and if over the threshold (default 1G), throw the exception -- and then later when we dispatch the request to internal queues, we'll count items in queue and if over default in any one queue (default is 10 * handler count), we'll fail dispatch and again throw CQTBE. We shouldn't be running w/ big queues. We should be rejecting Requests we know we'll never process in time before client loses interest (See the CoDel thesis and the implementations added a good while back). TODO. Meantime I was looking to see if having read a high-priority request, if rather than dropping it on the floor, instead, what would happen if I let it through even if above thresholds? My main concern is edits to hbase:meta. When sustained, saturated load on the RS carrying hbase:meta, edits may not land. The result is incomplete Procedures and a disorientated Master. I was playing w/ trying to put off the corruption as long as possible, experimenting (CoDel doesn't do priority at first blush; we probably want to add this). -- This message was sent by Atlassian Jira (v8.3.4#803005)