hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Corrupt HDFS and salvaging data
Date Fri, 09 May 2008 05:54:53 GMT
Lohit,


I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and started all daemons.
I see the fsck report does include this:
    Missing replicas:              17025 (29.727087 %)

According to your explanation, this means that after I removed 1 DN I started missing about
30% of the blocks, right?
Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I removed?  But how
could that be when I have replication factor of 3?

If I run bin/hadoop balancer with my old DN back in the cluster (and new DN removed), I do
get the happy "The cluster is balanced" response.  So wouldn't that mean that everything is
peachy and that if my replication factor is 3 then when I remove 1 DN, I should have only
some portion of blocks under-replicated, but not *completely* missing from HDFS?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: lohit <lohit_bv@yahoo.com>
> To: core-user@hadoop.apache.org
> Sent: Friday, May 9, 2008 1:33:56 AM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Hi Otis,
> 
> Namenode has location information about all replicas of a block. When you run 
> fsck, namenode checks for those replicas. If all replicas are missing, then fsck 
> reports the block as missing. Otherwise they are added to under replicated 
> blocks. If you specify -move or -delete option along with fsck, files with such 
> missing blocks are moved to /lost+found or deleted depending on the option. 
> At what point did you run the fsck command, was it after the datanodes were 
> stopped? When you run namenode -format it would delete directories specified in 
> dfs.name.dir. If directory exists it would ask for confirmation. 
> 
> Thanks,
> Lohit
> 
> ----- Original Message ----
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 9:00:34 PM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Hi,
> 
> Update:
> It seems fsck reports HDFS is corrupt when a significant-enough number of block 
> replicas is missing (or something like that).
> fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN.  After I 
> restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS 
> and started reporting *healthy* HDFS.
> 
> 
> I'll follow-up with re-balancing question in a separate email.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, May 8, 2008 11:35:01 PM
> > Subject: Corrupt HDFS and salvaging data
> > 
> > Hi,
> > 
> > I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm trying 
> > not to lose the precious data in it.  I accidentally run bin/hadoop namenode 
> > -format on a *new DN* that I just added to the cluster.  Is it possible for 
> that 
> > to corrupt HDFS?  I also had to explicitly kill DN daemons before that, 
> because 
> > bin/stop-all.sh didn't stop them for some reason (it always did so before).
> > 
> > Is there any way to salvage the data?  I have a 4-node cluster with 
> replication 
> > factor of 3, though fsck reports lots of under-replicated blocks:
> > 
> >   ********************************
> >   CORRUPT FILES:        3355
> >   MISSING BLOCKS:       3462
> >   MISSING SIZE:         17708821225 B
> >   ********************************
> > Minimally replicated blocks:   28802 (89.269775 %)
> > Over-replicated blocks:        0 (0.0 %)
> > Under-replicated blocks:       17025 (52.76779 %)
> > Mis-replicated blocks:         0 (0.0 %)
> > Default replication factor:    3
> > Average block replication:     1.7750744
> > Missing replicas:              17025 (29.727087 %)
> > Number of data-nodes:          4
> > Number of racks:               1
> > 
> > 
> > The filesystem under path '/' is CORRUPT
> > 
> > 
> > What can one do at this point to save the data?  If I run bin/hadoop fsck 
> -move 
> > or -delete will I lose some of the data?  Or will I simply end up with fewer 
> > block replicas and will thus have to force re-balancing in order to get back 
> to 
> > a "safe" number of replicas?
> > 
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


Mime
View raw message