hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert J Berger <rber...@runa.com>
Subject Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)
Date Sat, 17 Sep 2011 08:03:23 GMT
Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping me out on the
IRC channel. Well beyond the call of duty! Its people like Harsh that make the HBase/Hadoop
community what it is and one of the joys of working  with this technology. And then one follow
on question on how to recover from CORRUPT blocks.

The main thing I learnt other than being careful not to install packages on all the regionservers/slaves
at one time that may cause Out of Memory Errors and crash all your java processes.. is that
if:

Your namenode is stuck in safe mode, and even though the namenode log says that "Safe mode
will be turned off automatically."
If there is enough wrong with your HDFS system like too many under-replicated blocks.
It seems that it has to be out of safe mode to correct the problem... 

I hallucinated that the datanodes by doing verifications were doing the work to get the namenode
out of safe mode. And probably would have waited another few hours if Harsh hadn't helped
me out and told me what probably everyone but me knew:

hadoop dfsadmin -safemode leave


CURRENT QUESTION ON CORRUPT BLOCKS:
------------------------------------------------------------------

After that the namenode did get all the under-replicated blocks replicated, but I ended up
with about 200 blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were
being compacted when the outage occurred. Otherwise I don't know why a lot of the bad blocks
are in old tables, not data being written at the time of the crash. The hdfs filesystem dates
also showed them as being old.

I am not sure what is the best thing to do now to be able to recover the CORRUPT/MISSING blocks
and to get fsck to say all is healthy. 

Is the best thing to just do:

hadoop fsck -move

which will move what is left of the corrupt blocks into hdfs /lost+found?

Is there any way to recover those blocks? 

I may be able to get them from the backup/export of all our tables we did recently and I believe
I can regenerate the rest. But it would be nice to know if there is a way to recover them
if there was no other way.

Thanks in advance.
Rob
 
On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:

> Just had an HDFS/HBase instance where all the slave/regionservers processes crashed,
but the namenode stayed up. I did proper shutdown of the namenode
> 
> After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 235 corrupt/missing
blocks out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification succeeded.
As far as I can tell there are no errors in the datanodes.
> 
> Can I expect it to self-heal? Or do I need to do something to help it along? Anyway to
tell how long it will take to recover if I do have to just wait?
> 
> Other than the verification messages on the datanodes, the namenode fsck numbers are
not changing and the namenode log continues to say:
> 
> The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. Safe mode will
be turned off automatically.
> 
> The ratio has not changed for over an hour now.
> 
> If you happen to know the answer, please get back to me right away by email or on #hadoop
IRC as I'm trying to figure it out now...
> 
> Thanks!
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
> 
> 
> 

__________________
Robert J Berger - CTO
Runa Inc.
+1 408-838-8896
http://blog.ibd.com




Mime
View raw message