Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 57AD0200C30 for ; Tue, 21 Feb 2017 07:17:50 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 56475160B78; Tue, 21 Feb 2017 06:17:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9D6BE160B73 for ; Tue, 21 Feb 2017 07:17:49 +0100 (CET) Received: (qmail 38185 invoked by uid 500); 21 Feb 2017 06:17:48 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 38165 invoked by uid 99); 21 Feb 2017 06:17:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2017 06:17:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5B330186E82 for ; Tue, 21 Feb 2017 06:17:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.999 X-Spam-Level: X-Spam-Status: No, score=-1.999 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id L1Vm_X1MJi4W for ; Tue, 21 Feb 2017 06:17:46 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id CF2205FC2A for ; Tue, 21 Feb 2017 06:17:45 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 24AACE08C0 for ; Tue, 21 Feb 2017 06:17:45 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 89DEF24134 for ; Tue, 21 Feb 2017 06:17:44 +0000 (UTC) Date: Tue, 21 Feb 2017 06:17:44 +0000 (UTC) From: "Nick Dimiduk (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-17341) Add a timeout during replication endpoint termination MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 21 Feb 2017 06:17:50 -0000 [ https://issues.apache.org/jira/browse/HBASE-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Dimiduk updated HBASE-17341: --------------------------------- Fix Version/s: (was: 1.1.8) 1.1.9 > Add a timeout during replication endpoint termination > ----------------------------------------------------- > > Key: HBASE-17341 > URL: https://issues.apache.org/jira/browse/HBASE-17341 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4 > Reporter: Vincent Poon > Assignee: Vincent Poon > Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.5, 0.98.24, 1.1.9 > > Attachments: HBASE-17341.branch-1.1.v1.patch, HBASE-17341.branch-1.1.v2.patch, HBASE-17341.master.v1.patch, HBASE-17341.master.v2.patch > > > In ReplicationSource#terminate(), a Future is obtained from ReplicationEndpoint#stop(). Future.get() is then called, but can potentially hang there if something went wrong in the endpoint stop(). > Hanging there has serious implications, because the thread could potentially be the ZK event thread (e.g. watcher calls ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() -> blocked). This means no other events in the ZK event queue will get processed, which for HBase means other ZK watches such as replication watch notifications, snapshot watch notifications, even RegionServer shutdown will all get blocked. > The short term fix addressed here is to simply add a timeout for Future.get(). But the severe consequences seen here perhaps suggest a broader refactoring of the ZKWatcher usage in HBase is in order, to protect against situations like this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)