Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B7876200BE4 for ; Wed, 21 Dec 2016 20:41:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B628B160B26; Wed, 21 Dec 2016 19:41:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0BF0B160B18 for ; Wed, 21 Dec 2016 20:40:59 +0100 (CET) Received: (qmail 5673 invoked by uid 500); 21 Dec 2016 19:40:59 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 5632 invoked by uid 99); 21 Dec 2016 19:40:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Dec 2016 19:40:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D837D2C2A6B for ; Wed, 21 Dec 2016 19:40:58 +0000 (UTC) Date: Wed, 21 Dec 2016 19:40:58 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-17341) Add a timeout during replication endpoint termination MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 21 Dec 2016 19:41:00 -0000 [ https://issues.apache.org/jira/browse/HBASE-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767960#comment-15767960 ] Hudson commented on HBASE-17341: -------------------------------- FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #2173 (See [https://builds.apache.org/job/HBase-Trunk_matrix/2173/]) HBASE-17341 Add a timeout during replication endpoint termination (tedyu: rev cac0904c16dde9eb7bdbb57e4a33224dd4edb77f) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java > Add a timeout during replication endpoint termination > ----------------------------------------------------- > > Key: HBASE-17341 > URL: https://issues.apache.org/jira/browse/HBASE-17341 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4 > Reporter: Vincent Poon > Assignee: Vincent Poon > Priority: Critical > Fix For: 2.0.0, 1.4.0 > > Attachments: HBASE-17341.branch-1.1.v1.patch, HBASE-17341.branch-1.1.v2.patch, HBASE-17341.master.v1.patch, HBASE-17341.master.v2.patch > > > In ReplicationSource#terminate(), a Future is obtained from ReplicationEndpoint#stop(). Future.get() is then called, but can potentially hang there if something went wrong in the endpoint stop(). > Hanging there has serious implications, because the thread could potentially be the ZK event thread (e.g. watcher calls ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() -> blocked). This means no other events in the ZK event queue will get processed, which for HBase means other ZK watches such as replication watch notifications, snapshot watch notifications, even RegionServer shutdown will all get blocked. > The short term fix addressed here is to simply add a timeout for Future.get(). But the severe consequences seen here perhaps suggest a broader refactoring of the ZKWatcher usage in HBase is in order, to protect against situations like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)