hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-19218) Master stuck thinking hbase:namespace is assigned after restart preventing intialization
Date Wed, 08 Nov 2017 19:46:00 GMT
Josh Elser created HBASE-19218:
----------------------------------

             Summary: Master stuck thinking hbase:namespace is assigned after restart preventing
intialization
                 Key: HBASE-19218
                 URL: https://issues.apache.org/jira/browse/HBASE-19218
             Project: HBase
          Issue Type: Bug
            Reporter: Josh Elser
            Priority: Critical
             Fix For: 2.0.0-beta-1


Our [~romil.choksi] brought this one to my attention after trying to get some cluster tests
running.

The Master seems to have gotten stuck never initializing after it thinks that hbase:namespace
was already deployed on the cluster when it actually was not. On a Master restart, it reads
the location out of meta and assumes that it's there (I assume this invalid entry is the issue):

{noformat}
2017-11-08 00:29:17,556 INFO  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
assignment.RegionStateStore: Load hbase:meta entry region={ENCODED => f147f204a579b885c351bdc0a7ebbf94,
NAME => 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY =>
'', ENDKEY => ''} regionState=OPENING lastHost=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
regionLocation=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510100695534
{noformat}

Prior to this, the RS5 went through the ServerCrashProcedure, but it looks like this bailed
out unexpectedly:

{noformat}
2017-11-08 00:25:25,187 WARN  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
master.ServerManager: Expiration of ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
but server not online
2017-11-08 00:25:25,187 INFO  [ProcExecWrkr-5] procedure.ServerCrashProcedure: Start pid=36,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000003.hwx.site,16020,1510084580111,
splitWal=t
rue, meta=false
2017-11-08 00:25:25,188 INFO  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
master.ServerManager: Processing expiration of ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
on ctr-e134-1499953498516-28
2290-01-000003.hwx.site,20000,1510100690324
...
2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: CODE-BUG: Uncaught
runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94
java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
        at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171)
        at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223)
        at org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252)
        at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
        at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82)
        at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor: CODE-BUG: Uncaught
runtime exception for pid=37, state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
splitWal=true, meta=false
java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
        at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
        at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
        at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
        at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
2017-11-08 00:25:27,344 FATAL [ProcExecWrkr-8] procedure2.ProcedureExecutor: CODE-BUG: Uncaught
runtime exception for pid=37, state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
splitWal=true, meta=false
java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
        at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
        at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
        at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
        at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
2017-11-08 00:25:27,356 INFO  [ProcExecWrkr-5] procedure2.ProcedureExecutor: Rolled back pid=37,
state=ROLLEDBACK, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime
exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace,
region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; ServerCrashProcedure
server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728, splitWal=true,
meta=false exec-time=2.1650sec
{noformat}

Shortly after this, the master was restarted.

My hunch is that because the SCP crashed, we never invalidated the meta entries and got us
stuck thinking the region was assigned when it wasn't?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message