hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Yu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
Date Thu, 26 Jul 2018 00:27:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556448#comment-16556448

Ted Yu commented on HBASE-20919:

Triggered QA again:

> meta region can't be re-onlined when restarting cluster if opening rsgroup
> --------------------------------------------------------------------------
>                 Key: HBASE-20919
>                 URL: https://issues.apache.org/jira/browse/HBASE-20919
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer, master, rsgroup
>    Affects Versions: 2.0.1
>            Reporter: chenyang
>            Assignee: ChenYang
>            Priority: Major
>         Attachments: HBASE-20919-branch-2.0-01.patch, HBASE-20919-branch-2.0-02.patch,
bug2.png, hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log,
> if you open rsgroup, hbase-site.xml contains  below configuration.
> {code:java}
> <property>
>   <name>hbase.coprocessor.master.classes</name>
>   <value>org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint</value>
> </property>
> <property>
>   <name>hbase.master.loadbalancer.class&lt;/name>
>  <value>org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer</value>
> </property>
> {code}
> And you shut down the whole HBase cluster in the way:
>  # first shut down region server one by one
>  # shut down master
> Then you restart whole cluster in the way:
>  # start master
>  # start regionserver
> The hbase:meta region can not be re-online and the rsgroup can not be initialized successfully.
>  master logs:
> {code:java}
> 2018-07-12 18:27:08,775 INFO [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
> upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come online
> 2018-07-12 18:27:08,876 INFO [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
> aTableLocator: Failed verification of hbase:meta,,1 at address=bjpg-rs4732.yz02,60020,1531388712053,
> on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928
> at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729)
> at org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}
> The logs show that hbase:meta region is not online and rsgroup keeps retrying to initialize.
>  but why the hbase:meta region is not online?
>  The info-level logs and jstack had not enough infomation, so I added some debug logs
in test-source-code. Then i checked the master`s logs and region server`s logs, and found
the meta region assign procedure which hold the meta region lock not completed and not released
the lock forever, so the recoverMetaProcedure could not be executed. 
>  Why the first procedure not completed and not released meta region lock?
>  In the test logs, i found when assignmentManager assigned the region, it need to call
the rsgroup balancer which  have not been initialized completely, so throw NPE.  As a result,
the procedure not completed and not released the lock forever.
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262)
> at org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162)
> at org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864)
> at org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809)
> at org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113)
> at org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693)
> {code}
> !bug2.png!
> As shown in the figure named bug2.png listed in attachments, when we shutdown the last
region server, the master submit a ServerCrashProcedure. In the procedure, it will reassign
hbase:meta region, but at that moment, there is no online region server, so the procedure
can not be executed completely. Then we shut down master, the ServerCrashProcedure and it`a
subProcedures are stored into procedureStore.
>  When we restart master, at first,  the master blocks waiting for becoming active master. 
after becoming active master, it starts procedureExecutor. The procedureExecutor start to
read procedure from procedureStore and the pre serverCrashProcedure submit a assign region
task to assignmentManager`s queue. The processQueue thread and active-master thread block
waiting for online region servers. when we start a region server, the active-master thread
do some operations and init rsgroup balancer.  At the same time, the processQueue thread
start to call balancer. If the processQueue thread run faster than active master,  the processQueue
thread will throw NPE.  As a result, the procedure not complete and not release hbase:meta
region lock forever.
>   Now, my solution is  that initializing the balancer before calling startServiceThreads
in finishActiveMasterInitialization() of HMaster.But this may have some side effects for
>   Based on stack`s suggestion, i re-submit a new patch which waiting for initializing
rsgroup balancer before calling balance-methods.

This message was sent by Atlassian JIRA

View raw message