Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 58432 invoked from network); 9 May 2008 05:34:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 May 2008 05:34:38 -0000 Received: (qmail 55879 invoked by uid 500); 9 May 2008 05:34:33 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 55832 invoked by uid 500); 9 May 2008 05:34:33 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 55821 invoked by uid 99); 9 May 2008 05:34:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 May 2008 22:34:32 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [206.190.37.40] (HELO web53607.mail.re2.yahoo.com) (206.190.37.40) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 09 May 2008 05:33:44 +0000 Received: (qmail 32682 invoked by uid 60001); 9 May 2008 05:33:57 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Message-ID; b=njBm7bO+++IDD2K3iPRtet3vRiX23bsWenbRI86xZKAbGWSTnCtBbe9/eurK7DvK9955EPuxUbnCYrC2HzRhyLfn276Oepg9b/TnGYt3XCx4RvuOYUKenjdlvhmyrZ0LlqmFCdT1kkcfiE0IHND1kEg6FMWovjO1kPwFOH8jYAw=; X-YMail-OSG: zQEHVQEVM1m2nDOMy.ojTVW2XBBebLeuXqZCttC77cmMLdK.h0B6IorSzuLZjP.TAU3YUqvbp2dcDUNhT2tdceoiUiNBT7x2tm1I7L4fIycvWPu49ylT3MFm2g-- Received: from [24.7.124.141] by web53607.mail.re2.yahoo.com via HTTP; Thu, 08 May 2008 22:33:56 PDT X-Mailer: YahooMailRC/975.23 YahooMailWebService/0.7.185 Date: Thu, 8 May 2008 22:33:56 -0700 (PDT) From: lohit Subject: Re: Corrupt HDFS and salvaging data To: core-user@hadoop.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Message-ID: <61594.32480.qm@web53607.mail.re2.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org Hi Otis, Namenode has location information about all replicas of a block. When you run fsck, namenode checks for those replicas. If all replicas are missing, then fsck reports the block as missing. Otherwise they are added to under replicated blocks. If you specify -move or -delete option along with fsck, files with such missing blocks are moved to /lost+found or deleted depending on the option. At what point did you run the fsck command, was it after the datanodes were stopped? When you run namenode -format it would delete directories specified in dfs.name.dir. If directory exists it would ask for confirmation. Thanks, Lohit ----- Original Message ---- From: Otis Gospodnetic To: core-user@hadoop.apache.org Sent: Thursday, May 8, 2008 9:00:34 PM Subject: Re: Corrupt HDFS and salvaging data Hi, Update: It seems fsck reports HDFS is corrupt when a significant-enough number of block replicas is missing (or something like that). fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN. After I restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS and started reporting *healthy* HDFS. I'll follow-up with re-balancing question in a separate email. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Otis Gospodnetic > To: core-user@hadoop.apache.org > Sent: Thursday, May 8, 2008 11:35:01 PM > Subject: Corrupt HDFS and salvaging data > > Hi, > > I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm trying > not to lose the precious data in it. I accidentally run bin/hadoop namenode > -format on a *new DN* that I just added to the cluster. Is it possible for that > to corrupt HDFS? I also had to explicitly kill DN daemons before that, because > bin/stop-all.sh didn't stop them for some reason (it always did so before). > > Is there any way to salvage the data? I have a 4-node cluster with replication > factor of 3, though fsck reports lots of under-replicated blocks: > > ******************************** > CORRUPT FILES: 3355 > MISSING BLOCKS: 3462 > MISSING SIZE: 17708821225 B > ******************************** > Minimally replicated blocks: 28802 (89.269775 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 17025 (52.76779 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 3 > Average block replication: 1.7750744 > Missing replicas: 17025 (29.727087 %) > Number of data-nodes: 4 > Number of racks: 1 > > > The filesystem under path '/' is CORRUPT > > > What can one do at this point to save the data? If I run bin/hadoop fsck -move > or -delete will I lose some of the data? Or will I simply end up with fewer > block replicas and will thus have to force re-balancing in order to get back to > a "safe" number of replicas? > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch