hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Created: (HBASE-611) regionserver should do basic health check before reporting alls-well to the master
Date Fri, 02 May 2008 03:00:55 GMT
regionserver should do basic health check before reporting alls-well to the master
----------------------------------------------------------------------------------

                 Key: HBASE-611
                 URL: https://issues.apache.org/jira/browse/HBASE-611
             Project: Hadoop HBase
          Issue Type: Improvement
    Affects Versions: 0.1.2
            Reporter: stack
            Priority: Minor
             Fix For: 0.2.0


On IRC this afternoon, a user killed a regionserver.  It did something in HDFS.   Another
regionserver, one carrying the catalog tables, started to get exceptions out of HDFS.  The
last thing out of it was:

{code}
[15:55]	<jgray>	2008-05-01 15:49:51,710 FATAL org.apache.hadoop.hbase.HRegionServer:
Replay of hlog required. Forcing server restart
[15:55]	<jgray>	org.apache.hadoop.hbase.DroppedSnapshotException: Could not get block
locations. Aborting...
{code}

Thats fine.

Only it didn't go down... it was in a state where it continued to send the master pings as
though nothing was wrong so its lease never timed out and master was hosed because it couldn't
get to catalog tables.

Regionservers should do a basic check that alls-healthy before they ping the master.  If critical
threads have exited or a flag saying hdfs has been found bad has been set, then regionserver
should stop reporting the master so master can deploy its load elsewhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message