Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7F81A172B3 for ; Tue, 22 Sep 2015 16:09:04 +0000 (UTC) Received: (qmail 17471 invoked by uid 500); 22 Sep 2015 16:09:04 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 17422 invoked by uid 500); 22 Sep 2015 16:09:04 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 17409 invoked by uid 99); 22 Sep 2015 16:09:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Sep 2015 16:09:04 +0000 Date: Tue, 22 Sep 2015 16:09:04 +0000 (UTC) From: "Ted Yu (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-14458) AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-14458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902875#comment-14902875 ] Ted Yu commented on HBASE-14458: -------------------------------- +1 > AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server > ------------------------------------------------------------------------------------------------------------- > > Key: HBASE-14458 > URL: https://issues.apache.org/jira/browse/HBASE-14458 > Project: HBase > Issue Type: Bug > Components: IPC/RPC > Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.1.3 > Reporter: Samir Ahmic > Assignee: Samir Ahmic > Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-14458.patch > > > I have notice this issue while testing master branch in distributed mode. Reproduction steps: > 1. Write some data with hbase ltt > 2. While ltt is writing execute $graceful_stop.sh --restart --reload [rs] > 3. Wait until script start to reload regions to restarted server. In that moment ltt will stop writing and eventually fail. > After some digging i have notice that while ltt is working correctly there is single connection per regionserver (lsof for single connection, 27109 is ltt PID ) > {code} > java 27109 hbase 143u 210579579 0t0 TCP hnode1:40423->hnode5:16020 (ESTABLISHED) > {code} > and when in this example hnode5 server is restarted and script starts to reload regions on this server ltt start creating thousands of new tcp connections to this server: > {code} > java 27109 hbase *623u 210674415 0t0 TCP hnode1:52948->hnode5:16020 (ESTABLISHED) > java 27109 hbase *624u 210674416 0t0 TCP hnode1:52949->hnode5:16020 (ESTABLISHED) > java 27109 hbase *625u 210674417 0t0 TCP hnode1:52950->hnode5:16020 (ESTABLISHED) > java 27109 hbase *627u 210674419 0t0 TCP hnode1:52952->hnode5:16020 (ESTABLISHED) > java 27109 hbase *628u 210674420 0t0 TCP hnode1:52953->hnode5:16020 (ESTABLISHED) > java 27109 hbase *633u 210674425 0t0 TCP hnode1:52958->hnode5:16020 (ESTABLISHED) > ... > {code} > So here is what happened based on some additional logging and debugging: > - AsyncRpcClient never detected that regionserver is restarted because regions were moved and there was no write/read requests to this server and there is no some sort of heart-bit mechanism implemented > - because of above dead {code}AsyncRpcChannel{code} stayed in {code}PoolMap connections{code} > - when ltt detected that regions are moved back to hnode5 it tried to reconnect to hnode5 leading this issue > I was able to resolve this issue by adding following to AsyncRpcClient#createRpcChannel(): > {code} > synchronized (connections) { > if (closed) { > throw new StoppedRpcClientException(); > } > rpcChannel = connections.get(hashCode); > + if (rpcChannel != null && !rpcChannel.isAlive()) { > + LOG.debug(Removing dead channel from "+ rpcChannel.address.toString()); > + connections.remove(hashCode); > + } > if (rpcChannel == null || !rpcChannel.isAlive()) { > rpcChannel = new AsyncRpcChannel(this.bootstrap, this, ticket, serviceName, location); > connections.put(hashCode, rpcChannel); > {code} > I will attach patch after some more testing. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)