Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0A4DE200CB0 for ; Fri, 9 Jun 2017 02:27:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 08C4B160BE5; Fri, 9 Jun 2017 00:27:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4DF66160BD5 for ; Fri, 9 Jun 2017 02:27:22 +0200 (CEST) Received: (qmail 20818 invoked by uid 500); 9 Jun 2017 00:27:21 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 20807 invoked by uid 99); 9 Jun 2017 00:27:21 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jun 2017 00:27:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DF40A1A0785 for ; Fri, 9 Jun 2017 00:27:20 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id mmr9HXY0S-cB for ; Fri, 9 Jun 2017 00:27:19 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 5BC1C5F6BF for ; Fri, 9 Jun 2017 00:27:19 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C62ADE0AE8 for ; Fri, 9 Jun 2017 00:27:18 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4575621B51 for ; Fri, 9 Jun 2017 00:27:18 +0000 (UTC) Date: Fri, 9 Jun 2017 00:27:18 +0000 (UTC) From: "Maddineni Sukumar (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-16138) Cannot open regions after non-graceful shutdown due to deadlock with Replication Table MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 09 Jun 2017 00:27:23 -0000 [ https://issues.apache.org/jira/browse/HBASE-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043717#comment-16043717 ] Maddineni Sukumar commented on HBASE-16138: ------------------------------------------- Thanks [~tedyu@apache.org], Created new review board request https://reviews.apache.org/r/59939/ > Cannot open regions after non-graceful shutdown due to deadlock with Replication Table > -------------------------------------------------------------------------------------- > > Key: HBASE-16138 > URL: https://issues.apache.org/jira/browse/HBASE-16138 > Project: HBase > Issue Type: Sub-task > Components: Replication > Reporter: Joseph > Assignee: Ashu Pachauri > Priority: Critical > Attachments: HBASE-16138.patch, HBASE-16138-v1.patch, HBASE-16138-v2.patch > > > If we shutdown an entire HBase cluster and attempt to start it back up, we have to run the WAL pre-log roll that occurs before opening up a region. Yet this pre-log roll must record the new WAL inside of ReplicationQueues. This method call ends up blocking on TableBasedReplicationQueues.getOrBlockOnReplicationTable(), because the Replication Table is not up yet. And we cannot assign the Replication Table because we cannot open any regions. This ends up deadlocking the entire cluster whenever we lose Replication Table availability. > There are a few options that we can do, but none of them seem very good: > 1. Depend on Zookeeper-based Replication until the Replication Table becomes available > 2. Have a separate WAL for System Tables that does not perform any replication (see discussion at HBASE-14623) > Or just have a seperate WAL for non-replicated vs replicated regions > 3. Record the WAL log in the ReplicationQueue asynchronously (don't block opening a region on this event), which could lead to inconsistent Replication state > The stacktrace: > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.recordLog(ReplicationSourceManager.java:376) > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.preLogRoll(ReplicationSourceManager.java:348) > org.apache.hadoop.hbase.replication.regionserver.Replication.preLogRoll(Replication.java:370) > org.apache.hadoop.hbase.regionserver.wal.FSHLog.tellListenersAboutPreLogRoll(FSHLog.java:637) > org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:701) > org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:600) > org.apache.hadoop.hbase.regionserver.wal.FSHLog.(FSHLog.java:533) > org.apache.hadoop.hbase.wal.DefaultWALProvider.getWAL(DefaultWALProvider.java:132) > org.apache.hadoop.hbase.wal.RegionGroupingProvider.getWAL(RegionGroupingProvider.java:186) > org.apache.hadoop.hbase.wal.RegionGroupingProvider.getWAL(RegionGroupingProvider.java:197) > org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:240) > org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:1883) > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:363) > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129) > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > Does anyone have any suggestions/ideas/feedback? > Attached a review board at: https://reviews.apache.org/r/50546/ > It is still pretty rough, would just like some feedback on it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)