hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Liang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
Date Thu, 16 Nov 2017 19:02:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16255788#comment-16255788
] 

Yi Liang commented on HBASE-19287:
----------------------------------

This happens when I restart the cluster, I see this error many times.

The RecoverMetaProcedure have a step that will send AssignMetaRegion request to a target server.
If the request sent out successfully but then the target server down. 
{code}
try {
  final ExecuteProceduresResponse response = sendRequest(getServerName(), request.build());

 remoteCallCompleted(env, response);
} catch (IOException e) {
  e = unwrapException(e);

 // TODO: In the future some operation may want to bail out early.
  // TODO: How many times
should we retry (use numberOfAttemptsSoFar)
  if (!scheduleForRetry(e)) {
    remoteCallFailed(env,
e);
  }
}
{code}

So there are no exceptions for above code when send assign region request to target server.


But it seems that there is no timeout event to retry the assignProcedure or RecoverMetaProcedure.
So it will hang there forever. 

And there are also errors below, the stale one is the target server in the above RPC request.
{quote}
RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Triggering
server recovery; existingServer hadoop-slave2.hadoop,16020,1510341988652 looks stale, new
server:hadoop-slave2.hadoop,16020,1510342023184
2017-11-10 19:27:05,832 INFO  [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay
expiring server hadoop-slave2.hadoop,16020,1510341988652
{quote}

> master hangs forever if RecoverMeta send assign meta region request to target server
fail
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-19287
>                 URL: https://issues.apache.org/jira/browse/HBASE-19287
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] procedure.RecoverMetaProcedure: pid=138,
state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure failedMetaServer=null, splitWal=true;
Retaining meta assignment to server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: Initialized
subprocedures=[{pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] procedure.MasterProcedureScheduler: pid=139,
ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740,
target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: Start pid=139,
ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740,
target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, location=hadoop-slave1.hadoop,16020,1510341981454;
forceNewPlan=false, retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: Setting hbase:meta
(replicaId=0) location in ZooKeeper as hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] assignment.RegionTransitionProcedure:
Dispatch pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta,
region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] procedure.RSProcedureDispatcher:
Using procedure batch rpc execution for serverName=hadoop-slave2.hadoop,16020,1510341988652
version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted, processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master doesn't
enable ServerShutdownHandler during initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
master.ServerManager: Registering server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
master.ServerManager: Registering server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
master.ServerManager: Triggering server recovery; existingServer hadoop-slave2.hadoop,16020,1510341988652
looks stale, new server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay
expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException:
hbase:meta,,1 is not online on hadoop-slave2.hadoop,16020,1510342023184
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
>         at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
>  row 'hbase:namespace' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hadoop-slave2.hadoop,16020,1510341988652,
seqNum=0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message