hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-20087) Periodically attempt redeploy of regions in FAILED_OPEN state
Date Wed, 28 Feb 2018 00:12:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379528#comment-16379528

Andrew Purtell commented on HBASE-20087:

bq. I'm trying to think about what are the recent cases in which I've seen FAILED_OPEN's happening
and if automatically trying to assign them would have caused irreparable consequences (e.g.
riding over some data loss). Nothing is coming to mind 

We have kicked FAILED_OPEN assignments a few times, and I've done it more frequently when
testing rsgroups, so much so it became part of the backport. I think it is safe. If the condition
isn't resolved the retries will just continue to fail.  

bq. would be nice to be a little more word-y here, to be more clear that the region was FAILED_OPEN
and we're trying to help.

I changed the log line to 

    LOG.info("Retrying failed assignment for " + s.toDescriptiveString());

The string built by RegionState#toDescriptiveString prints the current state, which will be
FAILED_OPEN, so that should be clear.

> Periodically attempt redeploy of regions in FAILED_OPEN state
> -------------------------------------------------------------
>                 Key: HBASE-20087
>                 URL: https://issues.apache.org/jira/browse/HBASE-20087
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, Region Assignment
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 2.0.0, 1.5.0
>         Attachments: 0001-W-4723090-Port-the-RIT-FAILED_OPEN-state-hack-from-R.patch,
HBASE-20087-branch-1.patch, HBASE-20087-branch-1.patch
> Because RSGroups can cause permanent RIT with regions in FAILED_OPEN state, we added
logic to the master portion of the RSGroups extention to enumerate RITs and retry assignment
of regions in FAILED_OPEN state.
> However, this strategy can be applied generally to reduce need of operator involvement
in cluster operations. Now an operator has to manually resolve FAILED_OPEN assignments but
there is little risk in automatically retrying them after a while. If the reason the assignment
failed has not cleared, the assignment will just fail again. Should the reason the assignment
failed be resolved, then operators don't have to do more in order for the cluster to fully

This message was sent by Atlassian JIRA

View raw message