hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19144) [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster
Date Fri, 03 Nov 2017 19:52:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238274#comment-16238274

Andrew Purtell commented on HBASE-19144:

bq.  Question: Should DEFAULT_REASSIGN_WAIT_INTERVAL be larger than 10s? 

I originally had it at 30 seconds but this was an arbitrary choice. Since I was making one
of those, I made one biased towards testing. It can be changed. Do you have a suggestion?

bq. Should there be any jitter?

Not necessary I think, because the master is the solitary process which will be taking this

> [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster
> -----------------------------------------------------------------------------------
>                 Key: HBASE-19144
>                 URL: https://issues.apache.org/jira/browse/HBASE-19144
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>         Attachments: HBASE-19144-branch-1.patch, HBASE-19144-branch-1.patch, HBASE-19144.patch,
HBASE-19144.patch, HBASE-19144.patch
> After all servers in the RSgroup are down the regions cannot be opened anywhere and transition
rapidly into FAILED_OPEN state.
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates: Transition {c6c8150c9f4b8df25ba32073f25a5143
state=OFFLINE, ts=1509483985448, server=node-5.cluster,16020,1509482700768} to {c6c8150c9f4b8df25ba32073f25a5143
state=FAILED_OPEN, ts=1509483985449, server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates: Failed to open/close
d4e2f173e31ffad6aac140f4bd7b02bc on node-5.cluster,16020,1509482700768, set to FAILED_OPEN
> Any region in FAILED_OPEN state has to be manually reassigned, or the master can be restarted
and this will also cause reattempt of assignment of any regions in FAILED_OPEN state. This
is not unexpected but is an operational headache. It would be better if the RSGroupInfoManager
could automatically kick reassignments of regions in FAILED_OPEN state when servers rejoin
the cluster. 

This message was sent by Atlassian JIRA

View raw message