hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17704) Regions stuck in FAILED_OPEN when HDFS blocks are missing
Date Thu, 02 Mar 2017 20:55:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892976#comment-15892976

Andrew Purtell commented on HBASE-17704:

I agree. I didn't know about HBASE-16209. With an exponential backoff policy and a cap on
max wait time (I see that patch has it) there's no reason not to keep retrying indefinitely.
Even prior to that the old default of 10 attempts is too small. That wouldn't ride over some
transient issues. At some point operator intervention is necessary anyway, but we can get
paged by a region-in-transition-too-long alert to deal with it and there's no harm in having
the AM retry until we tell it not to with unassign_region or similar. 

> Regions stuck in FAILED_OPEN when HDFS blocks are missing
> ---------------------------------------------------------
>                 Key: HBASE-17704
>                 URL: https://issues.apache.org/jira/browse/HBASE-17704
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 1.1.8
>            Reporter: Mathias Herberts
> We recently experienced the loss of a whole rack (6 DNs + RS) in a 120 node cluster.
This lead to the regions which were present on the 6 RS which became unavailable to be reassigned
to live RSs. When attempting to open some of the reassigned regions, some RS encountered missing
blocks and issued "No live nodes contain current block Block locations" putting the regions
in state FAILED_OPEN.
> Once the disappeared DNs went back online, the regions were left in FAILED_OPEN, needing
a restart of all the affected RSs to solve the problem.

This message was sent by Atlassian JIRA

View raw message