hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tadas MakĨinskas <tadas.makcins...@bdc.lt>
Subject Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)
Date Thu, 20 Dec 2012 13:45:26 GMT
Robert J Berger <rberger@...> writes:

> 
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping 
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop 
community what it is and one of the
> joys of working  with this technology. And then one follow on question on how 
to recover from CORRUPT blocks.
> 
> The main thing I learnt other than being careful not to install packages on 
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java 
processes.. is that if:
> 
> Your namenode is stuck in safe mode, and even though the namenode log says 
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated 
blocks.
> It seems that it has to be out of safe mode to correct the problem... 
> 
> I hallucinated that the datanodes by doing verifications were doing the work 
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped 
me out and told me what
> probably everyone but me knew:
> 
> hadoop dfsadmin -safemode leave
> 
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
> 
> After that the namenode did get all the under-replicated blocks replicated, 
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were 
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old 
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being 
old.
> 
> I am not sure what is the best thing to do now to be able to recover the 
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy. 
> 
> Is the best thing to just do:
> 
> hadoop fsck -move
> 
> which will move what is left of the corrupt blocks into hdfs /lost+found?
> 
> Is there any way to recover those blocks? 
> 
> I may be able to get them from the backup/export of all our tables we did 
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover 
them if there was no other way.
> 
> Thanks in advance.
> Rob
> 
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
> 
> > Just had an HDFS/HBase instance where all the slave/regionservers processes 
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> > 
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification 
succeeded. As far as I can
> tell there are no errors in the datanodes.
> > 
> > Can I expect it to self-heal? Or do I need to do something to help it along? 
Anyway to tell how long it will take
> to recover if I do have to just wait?
> > 
> > Other than the verification messages on the datanodes, the namenode fsck 
numbers are not changing and the
> namenode log continues to say:
> > 
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. 
Safe mode will be turned off automatically.
> > 
> > The ratio has not changed for over an hour now.
> > 
> > If you happen to know the answer, please get back to me right away by email 
or on #hadoop IRC as I'm trying to
> figure it out now...
> > 
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> > 
> > 
> > 
> 
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
> 
> 

Having analogous situation here. Some of ours serves went away for a while. As 
we attached them back to the cluster it appeared that as a result we have 
multiple Missing/Corrupt blocks and some Mis-replicated blocks. 

I still can't figure out how to solve the issue of restoring the system to a 
normal working state. Can't figure out neither nice way to removing those 
corrupted files, nor restoring them. All of them are in the following folders: 
   /user/<user>/.Trash
   /user/<user>/.staging 

what following steps would be advised to solve our issue?

thanks, Tadas


Mime
View raw message