hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allan Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18167) OfflineMetaRepair tool may cause HMaster abort always
Date Tue, 20 Jun 2017 01:50:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055043#comment-16055043

Allan Yang commented on HBASE-18167:

In the current patch I am reusing the same configuration parameter "hbase.master.initializationmonitor.timeout"
for SSH wait timeout.
I feel it's better we introduce another parameter for this.
What do you say?
I think we don't need to introduce a new config. the old one is enough, another option is
that you can wait forever here, since there is already a timeout monitor.

> OfflineMetaRepair tool may cause HMaster abort always
> -----------------------------------------------------
>                 Key: HBASE-18167
>                 URL: https://issues.apache.org/jira/browse/HBASE-18167
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.0, 1.3.1, 1.3.2
>            Reporter: Pankaj Kumar
>            Assignee: Pankaj Kumar
>            Priority: Critical
>             Fix For: 1.4.0
>         Attachments: HBASE-18167-branch-1.patch
> In the production environment, we met a weird scenario where some Meta table HFile blocks
were missing due to some reason.
> To recover the environment we tried to rebuild the meta using OfflineMetaRepair tool
and restart the cluster, but HMaster couldn't finish it's initialization. It always timed
out as namespace table region was never assigned.
> Steps to reproduce
> ==================
> 1. Assign meta table region to HMaster (it can be on any RS, just to reproduce the  scenario)
> {noformat}
> 	<property>
>             <name>hbase.balancer.tablesOnMaster</name>
>             <value>hbase:meta</value>
>         </property>
> {noformat}
> 2. Start HMaster and RegionServer
> 2. Create two namespace, say "ns1" & "ns2"
> 3. Create two tables "ns1:t1' & "ns2:t1'
> 4. flush 'hbase:meta"
> 5. Stop HMaster (graceful shutdown)
> 6. Kill -9 RegionServer (Abnormal shutdown)
> 7. Run OfflineMetaRepair as follows,
> {noformat}
> 	hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair -fix
> {noformat}
> 8. Restart HMaster and RegionServer
> 9. HMaster will never be able to finish its initialization and abort always with below
> {code}
> 2017-06-06 15:11:07,582 FATAL [Hostname:16000.activeMasterManager] master.HMaster: Unhandled
exception. Starting shutdown.
> java.io.IOException: Timedout 120000ms waiting for namespace table to be assigned
>         at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:98)
>         at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1054)
>         at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:848)
>         at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:199)
>         at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1871)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Root cause
> ==========
> 1. During HM start up AM assumes that it's a failover scenario based on the existing
old WAL files, so SSH/SCP will split WAL files and assign the holding regions. 
> 2. During SSH/SCP it retrieves the server holding regions from meta/AM's in-memory-state,
but meta only had "regioninfo" entry (as already rebuild by OfflineMetaRepair). So empty region
will be returned and it wont trigger any assignment.
> 3. HMaster which is waiting for namespace table to be assigned will timeout and abort

This message was sent by Atlassian JIRA

View raw message