Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id F2777200CC2 for ; Wed, 21 Jun 2017 06:59:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id F1180160BEF; Wed, 21 Jun 2017 04:59:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 43285160BE1 for ; Wed, 21 Jun 2017 06:59:06 +0200 (CEST) Received: (qmail 77976 invoked by uid 500); 21 Jun 2017 04:59:05 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 77964 invoked by uid 99); 21 Jun 2017 04:59:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jun 2017 04:59:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id EC9431AF8EE for ; Wed, 21 Jun 2017 04:59:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id ftDBdOgPDOsC for ; Wed, 21 Jun 2017 04:59:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 53B2260CD9 for ; Wed, 21 Jun 2017 04:59:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2C7DBE0D28 for ; Wed, 21 Jun 2017 04:59:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3801B240BC for ; Wed, 21 Jun 2017 04:59:00 +0000 (UTC) Date: Wed, 21 Jun 2017 04:59:00 +0000 (UTC) From: "Hadoop QA (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-18167) OfflineMetaRepair tool may cause HMaster abort always MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 21 Jun 2017 04:59:07 -0000 [ https://issues.apache.org/jira/browse/HBASE-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056993#comment-16056993 ] Hadoop QA commented on HBASE-18167: ----------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 9s {color} | {color:red} HBASE-18167 does not apply to branch-1.3. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.3.0/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12873798/HBASE-18167.branch-1.3.V2.patch | | JIRA Issue | HBASE-18167 | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/7270/console | | Powered by | Apache Yetus 0.3.0 http://yetus.apache.org | This message was automatically generated. > OfflineMetaRepair tool may cause HMaster abort always > ----------------------------------------------------- > > Key: HBASE-18167 > URL: https://issues.apache.org/jira/browse/HBASE-18167 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 1.4.0, 1.3.1, 1.3.2 > Reporter: Pankaj Kumar > Assignee: Pankaj Kumar > Priority: Critical > Fix For: 1.4.0, 1.3.2 > > Attachments: HBASE-18167.branch-1.3.V2.patch, HBASE-18167-branch-1.patch, HBASE-18167-branch-1-V2.patch > > > In the production environment, we met a weird scenario where some Meta table HFile blocks were missing due to some reason. > To recover the environment we tried to rebuild the meta using OfflineMetaRepair tool and restart the cluster, but HMaster couldn't finish it's initialization. It always timed out as namespace table region was never assigned. > Steps to reproduce > ================== > 1. Assign meta table region to HMaster (it can be on any RS, just to reproduce the scenario) > {noformat} > > hbase.balancer.tablesOnMaster > hbase:meta > > {noformat} > 2. Start HMaster and RegionServer > 2. Create two namespace, say "ns1" & "ns2" > 3. Create two tables "ns1:t1' & "ns2:t1' > 4. flush 'hbase:meta" > 5. Stop HMaster (graceful shutdown) > 6. Kill -9 RegionServer (Abnormal shutdown) > 7. Run OfflineMetaRepair as follows, > {noformat} > hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair -fix > {noformat} > 8. Restart HMaster and RegionServer > 9. HMaster will never be able to finish its initialization and abort always with below message, > {code} > 2017-06-06 15:11:07,582 FATAL [Hostname:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown. > java.io.IOException: Timedout 120000ms waiting for namespace table to be assigned > at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:98) > at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1054) > at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:848) > at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:199) > at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1871) > at java.lang.Thread.run(Thread.java:745) > {code} > Root cause > ========== > 1. During HM start up AM assumes that it's a failover scenario based on the existing old WAL files, so SSH/SCP will split WAL files and assign the holding regions. > 2. During SSH/SCP it retrieves the server holding regions from meta/AM's in-memory-state, but meta only had "regioninfo" entry (as already rebuild by OfflineMetaRepair). So empty region will be returned and it wont trigger any assignment. > 3. HMaster which is waiting for namespace table to be assigned will timeout and abort always. -- This message was sent by Atlassian JIRA (v6.4.14#64029)