hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Stack (Jira)" <j...@apache.org>
Subject [jira] [Created] (HBASE-23247) [hbck2] Schedule SCPs for 'Unknown Servers'
Date Sun, 03 Nov 2019 01:08:00 GMT
Michael Stack created HBASE-23247:
-------------------------------------

             Summary: [hbck2] Schedule SCPs for 'Unknown Servers'
                 Key: HBASE-23247
                 URL: https://issues.apache.org/jira/browse/HBASE-23247
             Project: HBase
          Issue Type: Bug
          Components: hbck2
    Affects Versions: 2.2.2
            Reporter: Michael Stack
            Assignee: Michael Stack
             Fix For: 2.2.3


I've run into an 'Unknown Server' phenomenon; meta has regions assigned to servers that the
cluster no longer knows about. Fix is tough because new assign is insistent on checking the
close succeeded by trying to contact the 'unknown server' and being insistent on not moving
on until it succeeds; TODO. There are a few ways of obtaining this state of affairs. I'll
list a few below in a minute.

Meantime, an hbck2 'fix' should be just scheduling an SCP using scheduleRecoveries command
only in this case it falis before scheduling SCP with the below; i.e. a FNFE because no dir
for the 'Unknown Server'.

{code}
 22:41:13.909 [main] INFO  org.apache.hadoop.hbase.client.ConnectionImplementation - Closing
master protocol: MasterService
 Exception in thread "main" java.io.IOException: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.FileNotFoundException): java.io.FileNotFoundException:
File hdfs://nameservice1/hbase/genie/WALs/s1.d.com,16020,1571170081872 does not exist.
   at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:986)
   at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:122)
   at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1046)
   at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1043)
   at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1053)
   at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:258)
   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1802)
   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1844)
   at org.apache.hadoop.hbase.master.MasterRpcServices.containMetaWals(MasterRpcServices.java:2709)
   at org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2488)
   at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java)
   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)

   at org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:175)
   at org.apache.hadoop.hbase.client.Hbck.scheduleServerCrashProcedure(Hbck.java:118)
   at org.apache.hbase.HBCK2.scheduleRecoveries(HBCK2.java:345)
   at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:746)
   at org.apache.hbase.HBCK2.run(HBCK2.java:631)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
   at org.apache.hbase.HBCK2.main(HBCK2.java:865)
{code}

A simple fix makes it so I can schedule an SCP which indeed clears out the 'Unknown Server'
to restore saneness on the cluster.

As to how to get 'Unknown Server':

1. The current scenario came about because of this exception while processing a server crash
procedure made it so the SCP exited just after splitting logs but before it cleared old assigns.
A new server instance that came up after this one went down purged the server from dead servers
list though there were still Procedures in flight (The cluster was under a crippling overloading)

{code}
 2019-11-02 21:02:34,775 DEBUG org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure:
Done splitting WALs pid=112532, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; ServerCrashProcedure
server=s1.d.com,16020,1572668980355, splitWal=true, meta=false
 2019-11-02 21:02:34,775 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add
procedure pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure
server=s1.d.com,16020,1572668980355, splitWal=true, meta=false as the 2th rollback step
 2019-11-02 21:02:34,779 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure:
pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355,
splitWal=true, meta=false found RIT pid=101251, ppid=101123, state=SUCCESS, bypass=LOG-REDACTED
TransitRegionStateProcedure                            table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359,
ASSIGN; rit=OPENING, location=s1.d.com,16020,1572668980355, table=GENIE2_modality_syncdata,
region=fd2bd0f540756b8eba4c99301d2cf359
 2019-11-02 21:02:34,779 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure
server=s1.d.com,16020,1572668980355, splitWal=true, meta=false
 java.lang.NullPointerException
         at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:139)
         at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:132)
         at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:786)
         at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:741)
         at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:605)
         at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:183)
         at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:240)
         at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:409)
         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
         at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
         at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
 2019-11-02 21:02:34,779 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add
procedure pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true, exception=java.lang.NullPointerException
via CODE-BUG: Uncaught runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN,
locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355,   splitWal=true, meta=false:java.lang.NullPointerException;
ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, meta=false as the
3th rollback step
 2019-11-02 21:02:34,782 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true,
exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=112532,
state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355,
splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure server=s1.d.com,16020,1572668980355,
splitWal=true, meta=false
 java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
         at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
         at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
 2019-11-02 21:02:34,785 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG:
Uncaught runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true,
exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=112532,
state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355,
splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure server=s1.d.com,16020,1572668980355,
splitWal=true, meta=false
 java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
         at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
         at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
         at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
         at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
{code}

2. I'm pretty sure I ran into this when I cleared out the MasterProcWAL to start over fresh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message