hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Harvey <dan.har...@mendeley.com>
Subject Rolling out Hadoop/HBase updates
Date Tue, 29 Jun 2010 13:43:26 GMT
Hey,

I've been thinking about how we do out configuration and code updates for
Hadoop and HBase and was wondering what others do and what is the best
practice to avoid errors with HBase.

Currently we do a rolling update where we restart the services on one node
at a time, so shutting down the region server then restarting the datanode
and task trackers depending on what we are updating and what has change. But
with this I have occasional found errors with the HBase cluster afterwards
due to corrupt META table which I think could have been caused by restarting
the datanode, or maybe not waiting long enough for the cluster to sort out
loosing a region server before moving on to the next.

The most resent error upon restarting a node was :-

2010-06-29 10:46:44,970 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)

2010-06-29 10:46:44,970 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
HRegionServer: file system not available
java.io.IOException: File system is not available
        at
org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)


Followed by this for every region being served :-

2010-06-29 10:46:44,996 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)


After updating all the nodes all the region server shut down after a
few minutes reporting the following :-

2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
10.0.11.4:50010

2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
Could not append. Requesting close of hlog
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


2010-06-29 11:22:09,482 FATAL
org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
ioe:
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

2010-06-29 11:22:10,344 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
abort
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


This was fixed by restarting the master and starting the region servers
again, but it would be nice to know how to roll out changes cleaner.

How do other people here roll out updates to HBase / Hadoop? What order do
you restart services in and how long do you wait before moving to the next
node?

Just so you know we currently have 5 nodes and are getting another 10 to add
soon.

Thanks,

-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message