hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Corrupt HDFS and salvaging data
Date Fri, 09 May 2008 14:16:42 GMT
Hi,

Here are 2 "bin/hadoop fsck / -files -blocks locations" reports:

1) For the old HDFS cluster, reportedly HEALTHY, but with this inconsistency:

http://www.krumpir.com/fsck-old.txt.zip   ( < 1MB)

Total blocks:  32264 (avg. block size 11591245 B)
Minimally replicated blocks:   32264 (100.0 %)         <== looks GOOD, matches "Total blocks"
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       0 (0.0 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    3                                   <== should have 3 copies
of each block
Average block replication:     2.418051                    <== ???  shouldn't this be 3??
Missing replicas:              0 (0.0 %)                        <==
if the above is 2.41... how can I have 0 missing replicas?

2) For the cluster with 1 old DN replaced with 1 new DN:

http://www.krumpir.com/fsck-1newDN.txt.zip ( < 800KB)

 Minimally replicated blocks:   29917 (92.72564 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       17124 (53.074635 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     1.8145611
 Missing replicas:              17124 (29.249296 %)



Any help would be appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: lohit <lohit_bv@yahoo.com>
> To: core-user@hadoop.apache.org
> Sent: Friday, May 9, 2008 2:47:39 AM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> When you say all daemons, do you mean the entire cluster, including the 
> namenode?
> >According to your explanation, this means that after I removed 1 DN I started 
> missing about 30% of the blocks, right?
> No, You would only miss the replica. If all of your blocks have replication 
> factor of 3, then you would miss only one replica which was on this DN.
> 
> It would be good to see full report
> could you run hadoop fsck / -files -blocks -location?
> 
> That would give you much more detailed information. 
> 
> 
> ----- Original Message ----
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 10:54:53 PM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Lohit,
> 
> 
> I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and started 
> all daemons.
> I see the fsck report does include this:
>     Missing replicas:              17025 (29.727087 %)
> 
> According to your explanation, this means that after I removed 1 DN I started 
> missing about 30% of the blocks, right?
> Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I 
> removed?  But how could that be when I have replication factor of 3?
> 
> If I run bin/hadoop balancer with my old DN back in the cluster (and new DN 
> removed), I do get the happy "The cluster is balanced" response.  So wouldn't 
> that mean that everything is peachy and that if my replication factor is 3 then 
> when I remove 1 DN, I should have only some portion of blocks under-replicated, 
> but not *completely* missing from HDFS?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: lohit 
> > To: core-user@hadoop.apache.org
> > Sent: Friday, May 9, 2008 1:33:56 AM
> > Subject: Re: Corrupt HDFS and salvaging data
> > 
> > Hi Otis,
> > 
> > Namenode has location information about all replicas of a block. When you run 
> > fsck, namenode checks for those replicas. If all replicas are missing, then 
> fsck 
> > reports the block as missing. Otherwise they are added to under replicated 
> > blocks. If you specify -move or -delete option along with fsck, files with 
> such 
> > missing blocks are moved to /lost+found or deleted depending on the option. 
> > At what point did you run the fsck command, was it after the datanodes were 
> > stopped? When you run namenode -format it would delete directories specified 
> in 
> > dfs.name.dir. If directory exists it would ask for confirmation. 
> > 
> > Thanks,
> > Lohit
> > 
> > ----- Original Message ----
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, May 8, 2008 9:00:34 PM
> > Subject: Re: Corrupt HDFS and salvaging data
> > 
> > Hi,
> > 
> > Update:
> > It seems fsck reports HDFS is corrupt when a significant-enough number of 
> block 
> > replicas is missing (or something like that).
> > fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN.  After I 
> > restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS 
> > and started reporting *healthy* HDFS.
> > 
> > 
> > I'll follow-up with re-balancing question in a separate email.
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > ----- Original Message ----
> > > From: Otis Gospodnetic 
> > > To: core-user@hadoop.apache.org
> > > Sent: Thursday, May 8, 2008 11:35:01 PM
> > > Subject: Corrupt HDFS and salvaging data
> > > 
> > > Hi,
> > > 
> > > I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm 
> trying 
> > > not to lose the precious data in it.  I accidentally run bin/hadoop namenode

> 
> > > -format on a *new DN* that I just added to the cluster.  Is it possible for

> > that 
> > > to corrupt HDFS?  I also had to explicitly kill DN daemons before that, 
> > because 
> > > bin/stop-all.sh didn't stop them for some reason (it always did so before).
> > > 
> > > Is there any way to salvage the data?  I have a 4-node cluster with 
> > replication 
> > > factor of 3, though fsck reports lots of under-replicated blocks:
> > > 
> > >   ********************************
> > >   CORRUPT FILES:        3355
> > >   MISSING BLOCKS:       3462
> > >   MISSING SIZE:         17708821225 B
> > >   ********************************
> > > Minimally replicated blocks:   28802 (89.269775 %)
> > > Over-replicated blocks:        0 (0.0 %)
> > > Under-replicated blocks:       17025 (52.76779 %)
> > > Mis-replicated blocks:         0 (0.0 %)
> > > Default replication factor:    3
> > > Average block replication:     1.7750744
> > > Missing replicas:              17025 (29.727087 %)
> > > Number of data-nodes:          4
> > > Number of racks:               1
> > > 
> > > 
> > > The filesystem under path '/' is CORRUPT
> > > 
> > > 
> > > What can one do at this point to save the data?  If I run bin/hadoop fsck 
> > -move 
> > > or -delete will I lose some of the data?  Or will I simply end up with fewer

> 
> > > block replicas and will thus have to force re-balancing in order to get back

> 
> > to 
> > > a "safe" number of replicas?
> > > 
> > > Thanks,
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


Mime
View raw message