hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-19144) [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster
Date Thu, 02 Nov 2017 18:54:01 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236390#comment-16236390
] 

Andrew Purtell edited comment on HBASE-19144 at 11/2/17 6:53 PM:
-----------------------------------------------------------------

bq. Should we first check there is at least one server up

You mean for each rsgroup check if at least one server is up before assigning the region?
For each region would have to look up its group, then liveness check the registered servers.
Overkill, I'd say. If the FAILED_OPEN assignment fails again that's ok. Either it failed because
we can't fix the problem yet, or it failed because of a permanent condition unrelated to RSGroups,
and the operator will have to intervene as prior to this change. 


was (Author: apurtell):
bq. Should we first check there is at least one server up

You mean for each rsgroup check if at least one server is up before assigning the region?
Overkill, I'd say. If the FAILED_OPEN assignment fails again that's ok. Either it failed because
we can't fix the problem yet, or it failed because of a permanent condition unrelated to RSGroups,
and the operator will have to intervene as prior to this change. 

> [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-19144
>                 URL: https://issues.apache.org/jira/browse/HBASE-19144
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>
>         Attachments: HBASE-19144-branch-1.patch, HBASE-19144.patch
>
>
> After all servers in the RSgroup are down the regions cannot be opened anywhere and transition
rapidly into FAILED_OPEN state.
>  
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates: Transition {c6c8150c9f4b8df25ba32073f25a5143
state=OFFLINE, ts=1509483985448, server=node-5.cluster,16020,1509482700768} to {c6c8150c9f4b8df25ba32073f25a5143
state=FAILED_OPEN, ts=1509483985449, server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates: Failed to open/close
d4e2f173e31ffad6aac140f4bd7b02bc on node-5.cluster,16020,1509482700768, set to FAILED_OPEN
>  
> Any region in FAILED_OPEN state has to be manually reassigned, or the master can be restarted
and this will also cause reattempt of assignment of any regions in FAILED_OPEN state. This
is not unexpected but is an operational headache. It would be better if the RSGroupInfoManager
could automatically kick reassignments of regions in FAILED_OPEN state when servers rejoin
the cluster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message