hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19144) [RSgroups] Regions assigned to a RSGroup all go to FAILED_OPEN state when all servers in the group are unavailable
Date Wed, 01 Nov 2017 18:50:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234564#comment-16234564

Andrew Purtell commented on HBASE-19144:

It's debatable that kicking assignments in FAILED_OPEN state is always the right thing to
do generally when servers join the cluster. However with RSgroups I think it makes sense.
So, I put the logic to do this into the RSGroups master extension. 

We could make this a general facility in the master. Without RSgroups, most likely we would
be in FAILED_OPEN because of a corruption or runtime problem like persistent failure to load
a compression codec. Retrying assignments neither helps nor hurts in those cases, really.
On the other hand if a transient condition resulted in regions in FAILED_OPEN state (I've
seen that with Phoenix) then it would help to do this generally, not only when servers join
the cluster, but periodically as well. Or, on other issues, we've contemplated adding this
as a feature to hbck. 

> [RSgroups] Regions assigned to a RSGroup all go to FAILED_OPEN state when all servers
in the group are unavailable
> ------------------------------------------------------------------------------------------------------------------
>                 Key: HBASE-19144
>                 URL: https://issues.apache.org/jira/browse/HBASE-19144
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>         Attachments: HBASE-19144-branch-1.patch
> After all servers in the RSgroup are down the regions cannot be opened anywhere and transition
rapidly into FAILED_OPEN state.
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates: Transition {c6c8150c9f4b8df25ba32073f25a5143
state=OFFLINE, ts=1509483985448, server=node-5.cluster,16020,1509482700768} to {c6c8150c9f4b8df25ba32073f25a5143
state=FAILED_OPEN, ts=1509483985449, server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates: Failed to open/close
d4e2f173e31ffad6aac140f4bd7b02bc on node-5.cluster,16020,1509482700768, set to FAILED_OPEN
> Any region in FAILED_OPEN state has to be manually reassigned, or the master can be restarted
and this will also cause reattempt of assignment of any regions in FAILED_OPEN state. This
is not unexpected but is an operational headache. It would be better if the RSGroupInfoManager
could automatically kick reassignments of regions in FAILED_OPEN state when servers rejoin
the cluster. 

This message was sent by Atlassian JIRA

View raw message