hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject HBASE-1084
Date Fri, 26 Dec 2008 18:29:57 GMT
Ran another experiment. Cluster started with 16 regions, grew to ~700. Then the HRS serving
META went down. Eventually the cluster "recovered" ... with 20 regions. What happened to the
other ~680? Gone, from META at least. The mapreduce tasks started again and were happy to
process the only regions remaining. It was stunning. Of course with that level of data loss,
the results were no longer meaningful. I had to do a panic reinitialization so now an new
experiment is running. I didn't have time to look over the logs but my conjecture is there
was a file level problem during a compaction of META. If it happens again this way next time
I will look deeper. 

I did try to restart the cluster in an attempt to recover. When shutting down, many regionservers
threw DFS exceptions of the "null datanode[0]" variety. The master was unable to split log
files due to the same type of errors, even. Meanwhile a DFS file writer external to HBase
was happily creating files and writing blocks with no apparent trouble. As far as I can tell
the difference was it was short lived and recently started. 

I am running HBase 0.19 on Hadoop 0.18. Maybe that makes a difference, and DFS fixes or whatever
between 0.18 and 0.19 can improve reliability. However also I think my cluster is a laboratory
for determining why HBASE-1084 -- and the reliability improvements in any code that interacts
with the FS that are a part of it -- is needed. 

So I think the continuous writers scenario has found a new victim -- first it was heap, now
it is DFS. I seem to be able to get up to ~700 regions (from 16) over maybe 8 to 24 hours
before DFS starts taking down HRS. Sometimes recovery is fine, but sometimes as above the
results are disaster. Eventually, somewhere above 1000 regions -- last time it was at about
1400 -- unrecoverable file corruption is inevitable on at least one region, probability goes
to 1.0.

   - Andy


View raw message