hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lohit <lohit...@yahoo.com>
Subject Re: Corrupt HDFS and salvaging data
Date Fri, 09 May 2008 17:28:17 GMT
Hi Otis,

Thanks for the reports. Looks like you have lot of blocks with replication factor of 1 and
when the node which had these blocks was stopped, namenode started reporting the block as
missing, as it could not find any other replica. Here is what I did

Find all blocks with replication factor 1
> grep repl=1 ../tmp.1/fsck-old.txt  | awk '{print $2}'  | sort > repl_1

File all blocks reported MISSIN
> grep MISSIN fsck-1newDN.txt  | grep blk_ | awk '{print $2}' | sort > missing_new

diff to see 
>diff repl_1 missing_new | grep ">"

As you can see all missing blocks had replication factor of 1. The report does not show locations,
You could check for location and make sure all of them were from same datanode.
So, this should confirm why even after you adding a new data node, cluster is not healthy.
If you had replication factor of atleast 2 of these files, the under replicated block would
have been added to new datanode.

You can set replication factor of a file using 'hadoop dfs -setrep" command.

Thanks,
Lohit

----- Original Message ----
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
To: core-user@hadoop.apache.org
Sent: Friday, May 9, 2008 7:16:42 AM
Subject: Re: Corrupt HDFS and salvaging data

Hi,

Here are 2 "bin/hadoop fsck / -files -blocks locations" reports:

1) For the old HDFS cluster, reportedly HEALTHY, but with this inconsistency:

http://www.krumpir.com/fsck-old.txt.zip   ( < 1MB)

Total blocks:  32264 (avg. block size 11591245 B)
Minimally replicated blocks:   32264 (100.0 %)         <== looks GOOD, matches "Total blocks"
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       0 (0.0 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    3                                   <== should have 3 copies
of each block
Average block replication:     2.418051                    <== ???  shouldn't this be 3??
Missing replicas:              0 (0.0 %)                        <==
if the above is 2.41... how can I have 0 missing replicas?

2) For the cluster with 1 old DN replaced with 1 new DN:

http://www.krumpir.com/fsck-1newDN.txt.zip ( < 800KB)

Minimally replicated blocks:   29917 (92.72564 %)
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       17124 (53.074635 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    3
Average block replication:     1.8145611
Missing replicas:              17124 (29.249296 %)



Any help would be appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: lohit <lohit_bv@yahoo.com>
> To: core-user@hadoop.apache.org
> Sent: Friday, May 9, 2008 2:47:39 AM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> When you say all daemons, do you mean the entire cluster, including the 
> namenode?
> >According to your explanation, this means that after I removed 1 DN I started 
> missing about 30% of the blocks, right?
> No, You would only miss the replica. If all of your blocks have replication 
> factor of 3, then you would miss only one replica which was on this DN.
> 
> It would be good to see full report
> could you run hadoop fsck / -files -blocks -location?
> 
> That would give you much more detailed information. 
> 
> 
> ----- Original Message ----
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 10:54:53 PM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Lohit,
> 
> 
> I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and started 
> all daemons.
> I see the fsck report does include this:
>     Missing replicas:              17025 (29.727087 %)
> 
> According to your explanation, this means that after I removed 1 DN I started 
> missing about 30% of the blocks, right?
> Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I 
> removed?  But how could that be when I have replication factor of 3?
> 
> If I run bin/hadoop balancer with my old DN back in the cluster (and new DN 
> removed), I do get the happy "The cluster is balanced" response.  So wouldn't 
> that mean that everything is peachy and that if my replication factor is 3 then 
> when I remove 1 DN, I should have only some portion of blocks under-replicated, 
> but not *completely* missing from HDFS?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: lohit 
> > To: core-user@hadoop.apache.org
> > Sent: Friday, May 9, 2008 1:33:56 AM
> > Subject: Re: Corrupt HDFS and salvaging data
> > 
> > Hi Otis,
> > 
> > Namenode has location information about all replicas of a block. When you run 
> > fsck, namenode checks for those replicas. If all replicas are missing, then 
> fsck 
> > reports the block as missing. Otherwise they are added to under replicated 
> > blocks. If you specify -move or -delete option along with fsck, files with 
> such 
> > missing blocks are moved to /lost+found or deleted depending on the option. 
> > At what point did you run the fsck command, was it after the datanodes were 
> > stopped? When you run namenode -format it would delete directories specified 
> in 
> > dfs.name.dir. If directory exists it would ask for confirmation. 
> > 
> > Thanks,
> > Lohit
> > 
> > ----- Original Message ----
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, May 8, 2008 9:00:34 PM
> > Subject: Re: Corrupt HDFS and salvaging data
> > 
> > Hi,
> > 
> > Update:
> > It seems fsck reports HDFS is corrupt when a significant-enough number of 
> block 
> > replicas is missing (or something like that).
> > fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN.  After I 
> > restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS 
> > and started reporting *healthy* HDFS.
> > 
> > 
> > I'll follow-up with re-balancing question in a separate email.
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > ----- Original Message ----
> > > From: Otis Gospodnetic 
> > > To: core-user@hadoop.apache.org
> > > Sent: Thursday, May 8, 2008 11:35:01 PM
> > > Subject: Corrupt HDFS and salvaging data
> > > 
> > > Hi,
> > > 
> > > I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm 
> trying 
> > > not to lose the precious data in it.  I accidentally run bin/hadoop namenode

> 
> > > -format on a *new DN* that I just added to the cluster.  Is it possible for

> > that 
> > > to corrupt HDFS?  I also had to explicitly kill DN daemons before that, 
> > because 
> > > bin/stop-all.sh didn't stop them for some reason (it always did so before).
> > > 
> > > Is there any way to salvage the data?  I have a 4-node cluster with 
> > replication 
> > > factor of 3, though fsck reports lots of under-replicated blocks:
> > > 
> > >   ********************************
> > >   CORRUPT FILES:        3355
> > >   MISSING BLOCKS:       3462
> > >   MISSING SIZE:         17708821225 B
> > >   ********************************
> > > Minimally replicated blocks:   28802 (89.269775 %)
> > > Over-replicated blocks:        0 (0.0 %)
> > > Under-replicated blocks:       17025 (52.76779 %)
> > > Mis-replicated blocks:         0 (0.0 %)
> > > Default replication factor:    3
> > > Average block replication:     1.7750744
> > > Missing replicas:              17025 (29.727087 %)
> > > Number of data-nodes:          4
> > > Number of racks:               1
> > > 
> > > 
> > > The filesystem under path '/' is CORRUPT
> > > 
> > > 
> > > What can one do at this point to save the data?  If I run bin/hadoop fsck 
> > -move 
> > > or -delete will I lose some of the data?  Or will I simply end up with fewer

> 
> > > block replicas and will thus have to force re-balancing in order to get back

> 
> > to 
> > > a "safe" number of replicas?
> > > 
> > > Thanks,
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Mime
View raw message