Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Message-ID;
  b=njBm7bO+++IDD2K3iPRtet3vRiX23bsWenbRI86xZKAbGWSTnCtBbe9/eurK7DvK9955EPuxUbnCYrC2HzRhyLfn276Oepg9b/TnGYt3XCx4RvuOYUKenjdlvhmyrZ0LlqmFCdT1kkcfiE0IHND1kEg6FMWovjO1kPwFOH8jYAw=;
Date: Thu, 8 May 2008 22:33:56 -0700 (PDT)
From: lohit <lohit_bv@yahoo.com>
Subject: Re: Corrupt HDFS and salvaging data
To: core-user@hadoop.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Message-ID: <61594.32480.qm@web53607.mail.re2.yahoo.com>

Hi Otis,

Namenode has location information about all replicas of a block. When you run fsck, namenode checks for those replicas. If all replicas are missing, then fsck reports the block as missing. Otherwise they are added to under replicated blocks. If you specify -move or -delete option along with fsck, files with such missing blocks are moved to /lost+found or deleted depending on the option. 
At what point did you run the fsck command, was it after the datanodes were stopped? When you run namenode -format it would delete directories specified in dfs.name.dir. If directory exists it would ask for confirmation. 

Thanks,
Lohit

----- Original Message ----
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
To: core-user@hadoop.apache.org
Sent: Thursday, May 8, 2008 9:00:34 PM
Subject: Re: Corrupt HDFS and salvaging data

Hi,

Update:
It seems fsck reports HDFS is corrupt when a significant-enough number of block replicas is missing (or something like that).
fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN.  After I restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS and started reporting *healthy* HDFS.


I'll follow-up with re-balancing question in a separate email.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 11:35:01 PM
> Subject: Corrupt HDFS and salvaging data
> 
> Hi,
> 
> I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm trying 
> not to lose the precious data in it.  I accidentally run bin/hadoop namenode 
> -format on a *new DN* that I just added to the cluster.  Is it possible for that 
> to corrupt HDFS?  I also had to explicitly kill DN daemons before that, because 
> bin/stop-all.sh didn't stop them for some reason (it always did so before).
> 
> Is there any way to salvage the data?  I have a 4-node cluster with replication 
> factor of 3, though fsck reports lots of under-replicated blocks:
> 
>   ********************************
>   CORRUPT FILES:        3355
>   MISSING BLOCKS:       3462
>   MISSING SIZE:         17708821225 B
>   ********************************
> Minimally replicated blocks:   28802 (89.269775 %)
> Over-replicated blocks:        0 (0.0 %)
> Under-replicated blocks:       17025 (52.76779 %)
> Mis-replicated blocks:         0 (0.0 %)
> Default replication factor:    3
> Average block replication:     1.7750744
> Missing replicas:              17025 (29.727087 %)
> Number of data-nodes:          4
> Number of racks:               1
> 
> 
> The filesystem under path '/' is CORRUPT
> 
> 
> What can one do at this point to save the data?  If I run bin/hadoop fsck -move 
> or -delete will I lose some of the data?  Or will I simply end up with fewer 
> block replicas and will thus have to force re-balancing in order to get back to 
> a "safe" number of replicas?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch