Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 31925 invoked from network); 10 Aug 2009 22:38:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Aug 2009 22:38:39 -0000 Received: (qmail 4002 invoked by uid 500); 10 Aug 2009 22:38:43 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 3891 invoked by uid 500); 10 Aug 2009 22:38:43 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 3881 invoked by uid 99); 10 Aug 2009 22:38:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Aug 2009 22:38:43 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.147.107.21] (HELO mrout2-b.corp.re1.yahoo.com) (69.147.107.21) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Aug 2009 22:38:31 +0000 Received: from [10.72.106.226] (heighthigh-lx.corp.yahoo.com [10.72.106.226]) by mrout2-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id n7AMb2TZ091071 for ; Mon, 10 Aug 2009 15:37:02 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:subject: references:in-reply-to:content-type:content-transfer-encoding; b=IWqQjMnw9mqKrp9kmjQOkb4apIS9iN5IgxKB8j6mn6uwFF5NZHdnmbGoruczMSbo Message-ID: <4A80A10E.8010700@yahoo-inc.com> Date: Mon, 10 Aug 2009 15:37:02 -0700 From: Raghu Angadi User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Re: corrupt filesystem References: <4A809A0D.7060704@casalemedia.com> In-Reply-To: <4A809A0D.7060704@casalemedia.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org > I had assumed that if a replica became corrupt that it would be replaced > by a non-corrupt copy. > Is this not the case? yes it is. Usually some random block might be corrupted for various reasons and it gets replaced by another replica of the block. A block might stay in corrupt state if there are no good replicas left or new replicas could not be created. The actual reason might be hardware related (say a lot of nodes die) or a real software bug. If you use Hadoop-0.20 or later, you will notice a warning in red on the NameNode front page if some blocks are left with no good replicas. You don't need to run FSCK (which could be costly) each time. If you are interested, you could try to trace one of these block ids in NameNode log to see what happened it. We are always eager to hear about irrecoverable errors. Please mention hadoop version you are using. If the data is corrupt (rather than truncated or missing), you can fetch the data by using "-ignoreCrc" option to 'fs -get'. Raghu. Mayuran Yogarajah wrote: > Hello all, > > What can cause HDFS to become corrupt? I was running some jobs which > were failing. When I checked logs I saw that some files were corrupt so > I ran 'hadoop fsck /' which > showed that a few files were corrupt: > > /user/data/2009-07-01/165_2009-07-01.log: CORRUPT block > blk_1697509332927954816 > /user/data/2009-07-21/060_2009-07-21.log: CORRUPT block > blk_8841160612810933777 > /user/data/2009-07-26/173_2009-07-26.log: CORRUPT block > blk_-6669973789246139664 > > I had backups of these files so what I did was delete these and reload > them, so the file system > is OK now. What I'm wondering is how these files became corrupt. There > are 6 nodes in the > cluster and I have a replication factor of 3. > > I had assumed that if a replica became corrupt that it would be replaced > by a non-corrupt copy. > Is this not the case? > > Would there have been some way to recover the files if I didn't have any > backups ? > > Another concern is that I only found out HDFS was corrupt by accident. > I suppose I should have > a script run every few minutes to parse the results of 'hadoop fsck /' > and email if anything becomes > corrupt. How are people currently handling this ? > > thank you very much > M