hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13605) RegionStates should not keep its list of dead servers
Date Thu, 18 Jun 2015 23:17:01 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592707#comment-14592707
] 

Enis Soztutar commented on HBASE-13605:
---------------------------------------

I have reverted the patch from branch-1.1 for now, and opened HBASE-13937 for 1.1.1. The reason
is that with the 13605 v4 patch, there is one more issue that is uncovered which was conveniently
eclipsed by the current server ping mechanism in RS. 

The short story is that region is assigned before the WAL recovery finishes, a RS aborts and
before the zk session expiry happens, we try to move the region from master. RPC to unassign
fails with {{RegionServerNotRunningYetException}} on a new instance of the server, and the
AM thinks that I can put the region in OFFLINE state. Then the next assignment goes through
without the actual RS session expiration even happened. Obviously the problem is not with
the 13605_v4 patch (I think), but with the assumptions that AM can just OFFLINE the region
and assign again if unassign RPC fails for some reason. We cannot blindly assume that if unassign
RPC fails with RegionServerStoppedException, RegionServerNotRunningYetException, etc, the
region can be assigned now and rely on dead servers list for recovery checking. 

Anyway, I don't think I will spend any more time on this issue without a proper re-design
of AM / RS. My suggestion is to do HBASE-13937 for 1.1.1. 

> RegionStates should not keep its list of dead servers
> -----------------------------------------------------
>
>                 Key: HBASE-13605
>                 URL: https://issues.apache.org/jira/browse/HBASE-13605
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>            Priority: Critical
>             Fix For: 2.0.0, 1.0.2, 1.1.1
>
>         Attachments: hbase-13605_v1.patch, hbase-13605_v3-branch-1.1.patch, hbase-13605_v4-branch-1.1.patch,
hbase-13605_v4-master.patch
>
>
> As mentioned in https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761
and HBASE-12844 we should have only 1 source of cluster membership. 
> The list of dead server and RegionStates doing it's own liveliness check (ServerManager.isServerReachable())
has caused an assignment problem again in a test cluster where the region states "thinks"
that the server is dead and SSH will handle the region assignment. However the RS is not dead
at all, living happily, and never gets zk expiry or YouAreDeadException or anything. This
leaves the list of regions unassigned in OFFLINE state. 
> master assigning the region:
> {code}
> 15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates: Onlined 77dddcd50c22e56bfff133c0e1f9165b
on os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED => 77dddcd50c
> {code}
> Master then disabled the table, and unassigned the region:
> {code}
> 2015-04-20 09:02:27,158 WARN  [ProcedureExecutorThread-1] zookeeper.ZKTableStateManager:
Moving table loadtest_d1 state from DISABLING to DISABLING
>  Starting unassign of loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining),
current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN, ts=1429520545780,   server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268}
> bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
for region loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b.
> 2015-04-20 09:02:27,414 INFO  [AM.ZK.Worker-pool3-t316] master.RegionStates: Offlined
77dddcd50c22e56bfff133c0e1f9165b from os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> {code}
> On table re-enable, AM does not assign the region: 
> {code}
> 2015-04-20 09:02:30,415 INFO  [ProcedureExecutorThread-3] balancer.BaseLoadBalancer:
Reassigned 25 regions. 25 retained the pre-restart assignment.ยท
> 2015-04-20 09:02:30,415 INFO  [ProcedureExecutorThread-3] procedure.EnableTableProcedure:
Bulk assigning 25 region(s) across 5 server(s), retainAssignment=true
> l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't reach online
server os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager: Updating the state
to OFFLINE to allow to be reassigned by SSH
> nmentManager: Skip assigning loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b.,
it is on a dead but not processed yet server: os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message