Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 2 Nov 2017 18:42:00 +0000 (UTC)
From: "churro morales (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.13113437.1509488790000.134892.1509648120879@Atlassian.JIRA>
In-Reply-To: <JIRA.13113437.1509488790000@Atlassian.JIRA>
References: <JIRA.13113437.1509488790000@Atlassian.JIRA> <JIRA.13113437.1509488790042@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (HBASE-19144) [RSgroups] Retry assignments
 in FAILED_OPEN state when servers (re)join the cluster
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 02 Nov 2017 18:42:07 -0000


    [ https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236362#comment-16236362 ] 

churro morales edited comment on HBASE-19144 at 11/2/17 6:41 PM:
-----------------------------------------------------------------

lgtm, 

Should we first check there is at least one server up?  the master patch looks good because we have a serverAdded.  In the branch-1.4 patch looks like we only have serverChanged. 

So this case we could kick this off when we don't have any servers from our group up yet?
We could just do a wait for at least one server to come up, but i don't know if thats overkill

something like
{code}
while (Collections.disjoint(masterServices.getServerManager().getOnlineServersList(), RSGroupInfoManagerImpl.this.getDefaultServers()) { 
//wait
}
{code}

might be overkill but looks like not necessary in the master branch, only 1.4


was (Author: churromorales):
lgtm, 

Should we first check there is at least one server up?  the master patch looks good because we have a serverAdded.  In the branch-1.4 patch looks like we only have serverChanged. 

So this case we could kick this off when we don't have any servers from our group up yet?
We could just do a wait for at least one server to come up, but i don't know if thats overkill

something like
{code}
while (Collections.disjoint(masterServices.getServerManager().getOnlineServersList(), RSGroupInfoManagerImpl.this.getDefaultServers()) { 
//wait
}
{code}

might be overkill but otherwise lgtm!

> [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-19144
>                 URL: https://issues.apache.org/jira/browse/HBASE-19144
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>
>         Attachments: HBASE-19144-branch-1.patch, HBASE-19144.patch
>
>
> After all servers in the RSgroup are down the regions cannot be opened anywhere and transition rapidly into FAILED_OPEN state.
>  
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates: Transition {c6c8150c9f4b8df25ba32073f25a5143 state=OFFLINE, ts=1509483985448, server=node-5.cluster,16020,1509482700768} to {c6c8150c9f4b8df25ba32073f25a5143 state=FAILED_OPEN, ts=1509483985449, server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates: Failed to open/close d4e2f173e31ffad6aac140f4bd7b02bc on node-5.cluster,16020,1509482700768, set to FAILED_OPEN
>  
> Any region in FAILED_OPEN state has to be manually reassigned, or the master can be restarted and this will also cause reattempt of assignment of any regions in FAILED_OPEN state. This is not unexpected but is an operational headache. It would be better if the RSGroupInfoManager could automatically kick reassignments of regions in FAILED_OPEN state when servers rejoin the cluster. 


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)