hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19144) [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster
Date Thu, 02 Nov 2017 18:39:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236355#comment-16236355

Andrew Purtell commented on HBASE-19144:

bq. while (!hasChanged) ?

Good point. I suppose this should be changed everywhere. Javadoc of Object#wait says spurious
wakeups are possible. Let me make this change. 

This code also punts on interrupt handling. We should fall through and check the state of
master.isAborted and master.isStopped. If 'hasChanged' is still false we can just go back
to waiting. The threads are daemon threads so won't stop a shutdown. Will change this too.

> [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster
> -----------------------------------------------------------------------------------
>                 Key: HBASE-19144
>                 URL: https://issues.apache.org/jira/browse/HBASE-19144
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>         Attachments: HBASE-19144-branch-1.patch, HBASE-19144.patch
> After all servers in the RSgroup are down the regions cannot be opened anywhere and transition
rapidly into FAILED_OPEN state.
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates: Transition {c6c8150c9f4b8df25ba32073f25a5143
state=OFFLINE, ts=1509483985448, server=node-5.cluster,16020,1509482700768} to {c6c8150c9f4b8df25ba32073f25a5143
state=FAILED_OPEN, ts=1509483985449, server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates: Failed to open/close
d4e2f173e31ffad6aac140f4bd7b02bc on node-5.cluster,16020,1509482700768, set to FAILED_OPEN
> Any region in FAILED_OPEN state has to be manually reassigned, or the master can be restarted
and this will also cause reattempt of assignment of any regions in FAILED_OPEN state. This
is not unexpected but is an operational headache. It would be better if the RSGroupInfoManager
could automatically kick reassignments of regions in FAILED_OPEN state when servers rejoin
the cluster. 

This message was sent by Atlassian JIRA

View raw message