hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Poon (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-17341) Add a timeout during replication endpoint termination
Date Mon, 19 Dec 2016 21:26:58 GMT
Vincent Poon created HBASE-17341:

             Summary: Add a timeout during replication endpoint termination
                 Key: HBASE-17341
                 URL: https://issues.apache.org/jira/browse/HBASE-17341
             Project: HBase
          Issue Type: Bug
    Affects Versions: 1.2.4, 0.98.23, 1.1.7, 2.0.0, 1.3.0, 1.4.0
            Reporter: Vincent Poon
            Priority: Critical

In ReplicationSource#terminate(), a Future is obtained from ReplicationEndpoint#stop().  Future.get()
is then called, but can potentially hang there if something went wrong in the endpoint stop().

Hanging there has serious implications, because the thread could potentially be the ZK event
thread (e.g. watcher calls ReplicationSourceManager#removePeer() -> ReplicationSource#terminate()
-> blocked).  This means no other events in the ZK event queue will get processed, which
for HBase means other ZK watches such as replication watch notifications, snapshot watch notifications,
even RegionServer shutdown will all get blocked.

The short term fix addressed here is to simply add a timeout for Future.get().  But the severe
consequences seen here perhaps suggest a broader refactoring of the ZKWatcher usage in HBase
is in order, to protect against situations like this.

This message was sent by Atlassian JIRA

View raw message