hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yu Li (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-20156) Allow regionserver to live during HDFS failure
Date Thu, 08 Mar 2018 06:49:00 GMT
Yu Li created HBASE-20156:

             Summary: Allow regionserver to live during HDFS failure
                 Key: HBASE-20156
                 URL: https://issues.apache.org/jira/browse/HBASE-20156
             Project: HBase
          Issue Type: New Feature
            Reporter: Yu Li

Currently if something is wrong with HDFS, for example NN fencing or get into safe mode, RS
will abort itself immediately after detecting it (such as log roll or flush fail). And if
we have a large scale cluster with dense writing workload, there will be a huge amount of
WAL to split and replay when HDFS is back, and the recovery time might be tens of minutes
or even hours (actually we experienced this more than once in production, there're always
some surprise like unstable power supply for NN that we never expected...).

Here we propose to add an option to allow RS not aborting during HDFS failure, instead we
will throw exceptions to clients indicating we're out of service, while we could get recovered
right after HDFS is back.

This will also make it possible to restart HDFS in some extreme case, and allow us to survive
if anything wrong happened during HDFS upgrading.

This message was sent by Atlassian JIRA

View raw message