hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19218) Master stuck thinking hbase:namespace is assigned after restart preventing intialization
Date Wed, 08 Nov 2017 23:20:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244927#comment-16244927
] 

stack commented on HBASE-19218:
-------------------------------

FYI, if a CODE-BUG shows up, we've gone off the rails usually exhibited earlier up in the
log (NPEs after this emission are like NPEs after an OOME, symptoms, not a cause). Do we know
what this version means for hbase: HDP_VERSION=3.0.0.0-420? The line numbers are way off is
why I ask. Perhaps it an old branch-2 version? Looking at the issue, it looks like recovery
on top of another recovery. Usually this tends to work. Will wait on the version info before
doing more....

> Master stuck thinking hbase:namespace is assigned after restart preventing intialization
> ----------------------------------------------------------------------------------------
>
>                 Key: HBASE-19218
>                 URL: https://issues.apache.org/jira/browse/HBASE-19218
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 2.0.0-beta-1
>
>         Attachments: hbase-hbase-master-ctr-e134-1499953498516-282290-01-000003.hwx.site.log.zip,
hbase-site.xml
>
>
> Our [~romil.choksi] brought this one to my attention after trying to get some cluster
tests running.
> The Master seems to have gotten stuck never initializing after it thinks that hbase:namespace
was already deployed on the cluster when it actually was not. On a Master restart, it reads
the location out of meta and assumes that it's there (I assume this invalid entry is the issue):
> {noformat}
> 2017-11-08 00:29:17,556 INFO  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
assignment.RegionStateStore: Load hbase:meta entry region={ENCODED => f147f204a579b885c351bdc0a7ebbf94,
NAME => 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY =>
'', ENDKEY => ''} regionState=OPENING lastHost=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
regionLocation=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510100695534
> {noformat}
> Prior to this, the RS5 went through the ServerCrashProcedure, but it looks like this
bailed out unexpectedly:
> {noformat}
> 2017-11-08 00:25:25,187 WARN  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
master.ServerManager: Expiration of ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
but server not online
> 2017-11-08 00:25:25,187 INFO  [ProcExecWrkr-5] procedure.ServerCrashProcedure: Start
pid=36, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000003.hwx.site,16020,1510084580111,
splitWal=t
> rue, meta=false
> 2017-11-08 00:25:25,188 INFO  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
master.ServerManager: Processing expiration of ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
on ctr-e134-1499953498516-28
> 2290-01-000003.hwx.site,20000,1510100690324
> ...
> 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94
> java.lang.NullPointerException
>         at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
>         at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171)
>         at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223)
>         at org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252)
>         at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
>         at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82)
>         at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception for pid=37, state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
>         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
>         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
>         at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
>         at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,344 FATAL [ProcExecWrkr-8] procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception for pid=37, state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
>         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
>         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
>         at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
>         at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,356 INFO  [ProcExecWrkr-5] procedure2.ProcedureExecutor: Rolled back
pid=37, state=ROLLEDBACK, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
splitWal=true, meta=false exec-time=2.1650sec
> {noformat}
> Shortly after this, the master was restarted.
> My hunch is that because the SCP crashed, we never invalidated the meta entries and got
us stuck thinking the region was assigned when it wasn't?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message