hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19218) Master stuck thinking hbase:namespace is assigned after restart preventing intialization
Date Thu, 09 Nov 2017 14:13:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245713#comment-16245713
] 

Josh Elser commented on HBASE-19218:
------------------------------------

bq. The line numbers are way off is why I ask. Perhaps it an old branch-2 version? Looking
at the issue, it looks like recovery on top of another recovery. Usually this tends to work.
Will wait on the version info before doing more....

Sorry for the confusion boss. We have an auto-merge process internally which nabs changes
from upstream and pulls them into a private repo. I had assumed this was a recent (last few
days build), but I guess this cluster has been hanging around for a week.

Thanks for giving the actual commit, Ted. To echo him -- that should be equivalent (I'm not
aware of anything of consequence that we have committed privately besides build-infra details).

> Master stuck thinking hbase:namespace is assigned after restart preventing intialization
> ----------------------------------------------------------------------------------------
>
>                 Key: HBASE-19218
>                 URL: https://issues.apache.org/jira/browse/HBASE-19218
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 2.0.0-beta-1
>
>         Attachments: hbase-hbase-master-ctr-e134-1499953498516-282290-01-000003.hwx.site.log.zip,
hbase-site.xml
>
>
> Our [~romil.choksi] brought this one to my attention after trying to get some cluster
tests running.
> The Master seems to have gotten stuck never initializing after it thinks that hbase:namespace
was already deployed on the cluster when it actually was not. On a Master restart, it reads
the location out of meta and assumes that it's there (I assume this invalid entry is the issue):
> {noformat}
> 2017-11-08 00:29:17,556 INFO  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
assignment.RegionStateStore: Load hbase:meta entry region={ENCODED => f147f204a579b885c351bdc0a7ebbf94,
NAME => 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY =>
'', ENDKEY => ''} regionState=OPENING lastHost=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
regionLocation=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510100695534
> {noformat}
> Prior to this, the RS5 went through the ServerCrashProcedure, but it looks like this
bailed out unexpectedly:
> {noformat}
> 2017-11-08 00:25:25,187 WARN  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
master.ServerManager: Expiration of ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
but server not online
> 2017-11-08 00:25:25,187 INFO  [ProcExecWrkr-5] procedure.ServerCrashProcedure: Start
pid=36, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000003.hwx.site,16020,1510084580111,
splitWal=t
> rue, meta=false
> 2017-11-08 00:25:25,188 INFO  [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
master.ServerManager: Processing expiration of ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
on ctr-e134-1499953498516-28
> 2290-01-000003.hwx.site,20000,1510100690324
> ...
> 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94
> java.lang.NullPointerException
>         at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
>         at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171)
>         at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223)
>         at org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252)
>         at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
>         at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82)
>         at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception for pid=37, state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
>         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
>         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
>         at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
>         at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,344 FATAL [ProcExecWrkr-8] procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception for pid=37, state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
>         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
>         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
>         at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
>         at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,356 INFO  [ProcExecWrkr-5] procedure2.ProcedureExecutor: Rolled back
pid=37, state=ROLLEDBACK, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
ServerCrashProcedure server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
splitWal=true, meta=false exec-time=2.1650sec
> {noformat}
> Shortly after this, the master was restarted.
> My hunch is that because the SCP crashed, we never invalidated the meta entries and got
us stuck thinking the region was assigned when it wasn't?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message