Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7A3E4200B49 for ; Wed, 3 Aug 2016 19:54:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 78EB2160AB1; Wed, 3 Aug 2016 17:54:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C0172160A5D for ; Wed, 3 Aug 2016 19:54:21 +0200 (CEST) Received: (qmail 52695 invoked by uid 500); 3 Aug 2016 17:54:20 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 52625 invoked by uid 99); 3 Aug 2016 17:54:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Aug 2016 17:54:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id BA0CE2C0D65 for ; Wed, 3 Aug 2016 17:54:20 +0000 (UTC) Date: Wed, 3 Aug 2016 17:54:20 +0000 (UTC) From: "Joseph (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-15937) Figure out retry limit and timing for replication queue table operations MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 03 Aug 2016 17:54:22 -0000 [ https://issues.apache.org/jira/browse/HBASE-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph updated HBASE-15937: --------------------------- Attachment: HBASE-15937.patch Fixed a constant to match the comments and rerunning the Unit Tests > Figure out retry limit and timing for replication queue table operations > ------------------------------------------------------------------------ > > Key: HBASE-15937 > URL: https://issues.apache.org/jira/browse/HBASE-15937 > Project: HBase > Issue Type: Sub-task > Components: Replication > Reporter: Joseph > Assignee: Joseph > Attachments: HBASE-15937.patch > > > ReplicationQueuesHBaseImpl will abort the server if any of its HBase Table writes/reads fails. We should figure out a reasonable retry limit and pause duration for these operations. > As of now the timeouts look like: > Table initialization: > 240 retries > 1 minute pause (because the Master may not be initialized yet, createTable retries are immediately rejected by PleaseHoldException, so we should sleep in between RPC requests) > 1 minute RPC timeouts > Total: At minimum 2 hours of retries > Normal Replication Table operations: > 240 retries > 100 millis pause (because we assume the cluster is in a more stable state, we assume most exceptions will be RPC timeouts, so I am using the standard RPC pause) > 1 minute RPC timeouts > Total: Assuming operations fail because of RPC timeouts, a minimum of 2 hours of retries. With just pauses we only have 24 seconds. > All of these timeouts are configurable too though. -- This message was sent by Atlassian JIRA (v6.3.4#6332)