Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B1FAA183A1 for ; Thu, 10 Dec 2015 06:51:11 +0000 (UTC) Received: (qmail 24438 invoked by uid 500); 10 Dec 2015 06:51:11 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 24277 invoked by uid 500); 10 Dec 2015 06:51:11 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 24233 invoked by uid 99); 10 Dec 2015 06:51:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Dec 2015 06:51:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 171AF2C1F5C for ; Thu, 10 Dec 2015 06:51:11 +0000 (UTC) Date: Thu, 10 Dec 2015 06:51:11 +0000 (UTC) From: "Lars Hofhansl (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-14953) HBaseInterClusterReplicationEndpoint: Do not retry the whole batch of edits in case of RejectedExecutionException MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050198#comment-15050198 ] Lars Hofhansl commented on HBASE-14953: --------------------------------------- Interesting, didn't think of that case. Amazing how many problems a little change like this can cause. Why not add a real queue (i.e. not synchronous queue)? (In that case we need to set coreThreads to maxThreads as well, and allow core threads to time out) Since we're waiting on the futures to finish anyway, as they sit in the queue we'd naturally wait exactly the right amount of time, so the queue can be unbounded - eventually we'd have all workers waiting, which is what we want. > HBaseInterClusterReplicationEndpoint: Do not retry the whole batch of edits in case of RejectedExecutionException > ----------------------------------------------------------------------------------------------------------------- > > Key: HBASE-14953 > URL: https://issues.apache.org/jira/browse/HBASE-14953 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 2.0.0, 1.2.0, 1.3.0 > Reporter: Ashu Pachauri > Assignee: Ashu Pachauri > Priority: Critical > Attachments: HBASE-14953-V1.patch > > > When we have wal provider set to multiwal, the ReplicationSource has multiple worker threads submitting batches to HBaseInterClusterReplicationEndpoint. In such a scenario, it is quite common to encounter RejectedExecutionException because it takes quite long for shipping edits to peer cluster compared to reading edits from source and submitting more batches to the endpoint. > The logs are just filled with warnings due to this very exception. > Since we subdivide batches before actually shipping them, we don't need to fail and resend the whole batch if one of the sub-batches fails with RejectedExecutionException. Rather, we should just retry the failed sub-batches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)