From stephen mulcahy <stephen.mulc...@deri.org>
Subject HBase backups
Date Tue, 18 Aug 2009 10:25:47 GMT

I'm a relative newcomer to both HBase and Hadoop so please bear with me 
if some of my queries don't make sense.

I'm managing a small HBase cluster (1 dedicated master, 4 regionservers) 
and am currently attempting to take a backup of the data (we can 
regenerate the data in our HBase but it will take time). I've tried a 
number of different approaches (details below) - I'm wondering if I've 
missed an approach or whether the approach I'm using is the best. All 
comments welcome.

I'm using HBase 0.19.3 running on top of Hadoop 0.19.1 and our HBase 
contains a single table with about 50 million rows.

1. Initially, I came across 
http://issues.apache.org/jira/browse/HBASE-897 which seemed like the 
ideal way for us to backup our HBase installation while allowing it to 
continue running. I ran into a number of problems with this, which I 
suspect are due to my HBase cluster being underpowered (I first ran into 
OutOfMemory exceptions, after bumping the JVM max heap size on the 
client to 512MB - then I saw some java.lang.NullPointerException during 
the map phase - I'm not sure if these are due to resource issues on the 
HBase cluster or some underlying corruption in HBase).

After adding the following to HBase

export HBASE_OPTS="-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-Xloggc:/home/hadoop/hbase/logs/gc-hbase.log" and setting



in the Hadoop config on the system submitting the backup job, it seemed 
to progress further, but ultimately died with various failures including 
the following,

java.io.IOException: All datanodes are bad. 
Aborting... at 


which again suggests to me that maybe our cluster isn't beefy enough to 
run HBase and the M/R job required to do the backup.

2. Given the lack of success with the M/R backup - I figured I'd 
shutdown HBase and try a copyToLocal of the entire /hbase tree.

This failed after a few minutes with the following error,

09/08/17 17:53:07 INFO hdfs.DFSClient: No node available for block:
09/08/17 17:53:07 INFO hdfs.DFSClient: Could not obtain block
blk_7870832778982080356_55873 from any node: java.io.IOException: No
live nodes contain current block

(and a bunch of other errors - all the same). This suggests to me that 
there is some issue with our HBase and that some corruption has occured. 
Looking in JIRA, there seem to be a few instances where this can occur 
in 0.19.3 / 0.19.1. I tried running HDFS fsck - but it reports the 
entire filesystem as healthy. Is there anything I can run to force HBase 
to verify it's integrity and drop any rows affected by the above problem?

3. Having failed with the copyToLocal, plan C was to try a -distcp to 
another cluster. Initially efforts with -distcp failed with errors about 
bad blocks again. I tried running -distcp with the -i option (to ignore 
errors) and the copy completed. I've configured up Hbase on the copy 
destination to use the copied hbase tree and it seems to start ok. I'm 
currently running a count against the copied hbase table to see how 
different it is from the original. Does it seem likely that my copy is 
corrupt or will Hbase handle the missing blocks gracefully? How do other 
people verify the integrity of their HBase? Are there tools like fsck 
which can be run at the HBase level?

Any comments on my approach to backups welcome, as I say, I'm far from 
the top of this particular learning curve!



Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

