hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-12844) ServerManager.isServerReacable() should sleep between retries
Date Tue, 13 Jan 2015 03:01:34 GMT
Enis Soztutar created HBASE-12844:
-------------------------------------

             Summary: ServerManager.isServerReacable() should sleep between retries
                 Key: HBASE-12844
                 URL: https://issues.apache.org/jira/browse/HBASE-12844
             Project: HBase
          Issue Type: Bug
            Reporter: Enis Soztutar
            Assignee: Enis Soztutar
             Fix For: 1.0.0, 2.0.0, 1.1.0


There is a fundamental problem with the way assignment manager and cluster membership works.
Basically,  the root cause of most of the complexity and root cause for many bugs is that
we do have multiple "cluster membership" sources. This causes problems when they diverge from
each other. 

Master's in-memory ServerManager class keep track of what servers are online and what servers
are considered dead. We have online and dead servers list in ServerManager, and a separate
dead servers list in RegionStates. 

There are at least 3 ways that a server can join into the dead list. First is the zookeeper
session. If a server loses it's zk session, the master gets notification and expires the server.
This is the regular way. 

Second is calls through ServerManager.expireServer(). On master this is mostly through master
rejoining the cluster. Master waits for some time for RS's to heartbeat and expires all others
and process them as dead servers.  This method has the potential to hijack the regions in
a region server without  the region server knowing about it (and thus can cause multi homing
of regions for reads etc). 

Third is the RegionStates calling ServerManager.isServerReachable() and if not adding the
server to it's own dead list, but not to the dead list of ServerManager. 

Obviously, as in the region assignment case as well as this, we should fix the "state is kept
in multiple places" syndrome, but not in this issue (we already have HBASE-5487, etc for that).


In this issue we should at least solve the following case: 

When a region server is starting up, it will throw exceptions when we want to ping:
{code}
2015-01-10 00:23:10,369 DEBUG [AM.-pool1-t5] master.ServerManager: Couldn't reach os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091,
try=0 of 10
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:
Server is not running yet
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
        at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:309)
        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1794)
        at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:810)
        at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:756)
        at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1952)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1559)
        at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
        at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)

        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
        at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:21819)
        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1791)
        ... 9 more


....

2015-01-10 00:23:10,399 DEBUG [AM.-pool1-t5] master.ServerManager: Couldn't reach os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091,
try=9 of 10
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:
Server is not running yet
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
        at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:309)
        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1794)
        at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:810)
        at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:756)
        at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1952)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1559)
        at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
        at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)

        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
        at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:21819)
        at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1791)
        ... 9 more


{code}

After 10 attempts happening in 10s of milliseconds (as opposed to sleeping between retries,
the server is put in the dead servers list in RegionStates (but not in ServerManager's dead
servers list). This results in the region server never receiving YouAreDeadException, and
the ServerManager thinking that the server is alive and well, while the RegionStates thinks
that the RS is dead and not assigning regions: 

{code}
2015-01-10 00:23:13,163 INFO  [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0]
master.AssignmentManager: Assigning 2 region(s) to os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091

2015-01-10 00:23:13,170 WARN  [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0]
master.RegionStates: Couldn't reach online server os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
{code}

This also prevents unassign etc leaving the regions in transition state forever (until the
admin kills the RS manually). 

{code}
2015-01-10 00:23:13,188 INFO  [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0]
master.AssignmentManager: Skip assigning loadtest_d1,cccccccc,1420849388510.15a752a6ad4b3a21c0d471483a225144.,
it is on a dead but not processed yet server: os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
{code}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message