Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 10 Dec 2015 06:51:11 +0000 (UTC)
From: "Lars Hofhansl (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12920223.1449615359000.321919.1449730271091@Atlassian.JIRA>
In-Reply-To: <JIRA.12920223.1449615359000@Atlassian.JIRA>
References: <JIRA.12920223.1449615359000@Atlassian.JIRA>
 <JIRA.12920223.1449615359195@arcas>
Subject: [jira] [Commented] (HBASE-14953)
 HBaseInterClusterReplicationEndpoint: Do not retry the whole batch of edits
 in case of RejectedExecutionException
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050198#comment-15050198 ] 

Lars Hofhansl commented on HBASE-14953:
---------------------------------------

Interesting, didn't think of that case. Amazing how many problems a little change like this can cause.

Why not add a real queue (i.e. not synchronous queue)? (In that case we need to set coreThreads to maxThreads as well, and allow core threads to time out)

Since we're waiting on the futures to finish anyway, as they sit in the queue we'd naturally wait exactly the right amount of time, so the queue can be unbounded - eventually we'd have all workers waiting, which is what we want.


> HBaseInterClusterReplicationEndpoint: Do not retry the whole batch of edits in case of RejectedExecutionException
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14953
>                 URL: https://issues.apache.org/jira/browse/HBASE-14953
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.0.0, 1.2.0, 1.3.0
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Critical
>         Attachments: HBASE-14953-V1.patch
>
>
> When we have wal provider set to multiwal, the ReplicationSource has multiple worker threads submitting batches to HBaseInterClusterReplicationEndpoint. In such a scenario, it is quite common to encounter RejectedExecutionException because it takes quite long for shipping edits to peer cluster compared to reading edits from source and submitting more batches to the endpoint. 
> The logs are just filled with warnings due to this very exception.
> Since we subdivide batches before actually shipping them, we don't need to fail and resend the whole batch if one of the sub-batches fails with RejectedExecutionException. Rather, we should just retry the failed sub-batches. 


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)