hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: HBASE-1084
Date Fri, 26 Dec 2008 18:58:17 GMT
Andrew Purtell wrote:
> Ran another experiment. Cluster started with 16 regions, grew to ~700. Then the HRS serving
META went down. Eventually the cluster "recovered" ... with 20 regions. What happened to the
other ~680? Gone, from META at least. 

Were they still present on the filesystem Andrew?

> The mapreduce tasks started again and were happy to process the only regions remaining.
It was stunning. Of course with that level of data loss, the results were no longer meaningful.
I had to do a panic reinitialization so now an new experiment is running. I didn't have time
to look over the logs but my conjecture is there was a file level problem during a compaction
of META. If it happens again this way next time I will look deeper. 

OK.  Let me help out.  Lets check datanode logs too for OOMEs or for 
"xceiverCount X exceeds the limit of concurrent xcievers Y" or for any 
other complaint that would give us a clue as to why it is fragile at 
1000+ nodes.

Losing that many edits to .META. "shouldn't" happen; we should be 
flushing the commit log so that even if a fat memcache flush fails, 
we'll have the commit log to replay.  It may have fallen into other 
'holes' such as the one where master will not split logs if shutdown.

> I did try to restart the cluster in an attempt to recover. When shutting down, many regionservers
threw DFS exceptions of the "null datanode[0]" variety. The master was unable to split log
files due to the same type of errors, even. Meanwhile a DFS file writer external to HBase
was happily creating files and writing blocks with no apparent trouble. As far as I can tell
the difference was it was short lived and recently started. 

> I am running HBase 0.19 on Hadoop 0.18. Maybe that makes a difference, and DFS fixes
or whatever between 0.18 and 0.19 can improve reliability. 


> However also I think my cluster is a laboratory for determining why HBASE-1084 -- and
the reliability improvements in any code that interacts with the FS that are a part of it
-- is needed. 
> So I think the continuous writers scenario has found a new victim -- first it was heap,
now it is DFS. I seem to be able to get up to ~700 regions (from 16) over maybe 8 to 24 hours
before DFS starts taking down HRS. Sometimes recovery is fine, but sometimes as above the
results are disaster. Eventually, somewhere above 1000 regions -- last time it was at about
1400 -- unrecoverable file corruption is inevitable on at least one region, probability goes
to 1.0.

Let me take a look at 1084.


View raw message