hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19144) [RSgroups] Regions assigned to a RSGroup all go to FAILED_OPEN state when all servers in the group are unavailable
Date Tue, 31 Oct 2017 22:33:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227713#comment-16227713
] 

Andrew Purtell commented on HBASE-19144:
----------------------------------------

bq. It would be better if the RSGroupInfoManager could automatically kick reassignments of
regions in FAILED_OPEN state when servers rejoin the cluster. 

I threw together a hack (imagine this done with another registered ServerListener) which addresses
the problem as reported, in that regions in FAILED_STATE due to constraint failures when a
whole RSGroup is down are all reassigned, but am not sure this is the best way:

{code}
diff --git a/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
b/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgro
up/RSGroupInfoManagerImpl.java
index 80eaefb036..d6a7ec120d 100644
--- a/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
+++ b/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
@@ -531,6 +535,19 @@ public class RSGroupInfoManagerImpl implements RSGroupInfoManager, ServerListene
             prevDefaultServers = servers;
             LOG.info("Updated with servers: "+servers.size());
           }
+
+          // Kick assignments that may be in FAILED_OPEN state
+          List<HRegionInfo> failedAssignments = Lists.newArrayList();
+          for (RegionState state: 
+              mgr.master.getAssignmentManager().getRegionStates().getRegionsInTransition())
{
+            if (state.isFailedOpen()) {
+              failedAssignments.add(state.getRegion());
+            }
+          }
+          for (HRegionInfo region: failedAssignments) {
+            mgr.master.getAssignmentManager().unassign(region);
+          }
+
           try {
             synchronized (this) {
               if(!hasChanged) {
{code}


Testing was with branch-1.4 / branch-1. I also need to check how branch-2 behaves. 

> [RSgroups] Regions assigned to a RSGroup all go to FAILED_OPEN state when all servers
in the group are unavailable
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-19144
>                 URL: https://issues.apache.org/jira/browse/HBASE-19144
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>            Reporter: Andrew Purtell
>
> After all servers in the RSgroup are down the regions cannot be opened anywhere and transition
rapidly into FAILED_OPEN state.
>  
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates: Transition {c6c8150c9f4b8df25ba32073f25a5143
state=OFFLINE, ts=1509483985448, server=node-5.cluster,16020,1509482700768} to {c6c8150c9f4b8df25ba32073f25a5143
state=FAILED_OPEN, ts=1509483985449, server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates: Failed to open/close
d4e2f173e31ffad6aac140f4bd7b02bc on node-5.cluster,16020,1509482700768, set to FAILED_OPEN
>  
> Any region in FAILED_OPEN state has to be manually reassigned, or the master can be restarted
and this will also cause reattempt of assignment of any regions in FAILED_OPEN state. This
is not unexpected but is an operational headache. It would be better if the RSGroupInfoManager
could automatically kick reassignments of regions in FAILED_OPEN state when servers rejoin
the cluster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message