Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 6920 invoked from network); 19 Feb 2011 08:38:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Feb 2011 08:38:01 -0000 Received: (qmail 94728 invoked by uid 500); 19 Feb 2011 08:11:19 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 94410 invoked by uid 500); 19 Feb 2011 08:11:17 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 94402 invoked by uid 99); 19 Feb 2011 08:11:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Feb 2011 08:11:16 +0000 X-ASF-Spam-Status: No, hits=3.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of srini30005@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bw0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Feb 2011 08:11:11 +0000 Received: by bwz8 with SMTP id 8so4518700bwz.35 for ; Sat, 19 Feb 2011 00:10:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=K6gfd9thbm3MZ9MvUvg6SZinKsCqughnEuHox+JYqfs=; b=tQ178p4EmMwxXBVyc44SZupbwHATKGCWLVUoIESmBxsvCTbCcwRAq9lKoBP79Tlne1 g/alN3hjMumzAmSI2I/wdTAM7YL3ydsLOm+48vybXnCdo1SibfGzQmaIcQgeZ9Y+DAte xC7x6jtENBE20S5ixOH/TZGOyqp5RslVeXGK4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=fsR+ICQqNsilQUmVkuLz+3nuQnWdDk7+zXQ96o8WIwWqxdOZ8jgVUhfb6rlznVJ8QD Bj3RS1mGFR6a7qZW6hqKuoIFxfnA+aLBIZChFCQxyhlGyuASu3XNPTq9+vFqDU+W1mv9 Ljz5zpK+/JMpfBddmEalLe/FI4vZS7P5D70Yw= MIME-Version: 1.0 Received: by 10.204.103.132 with SMTP id k4mr1500886bko.28.1298103049628; Sat, 19 Feb 2011 00:10:49 -0800 (PST) Received: by 10.204.129.197 with HTTP; Sat, 19 Feb 2011 00:10:49 -0800 (PST) In-Reply-To: <1C845548-B875-475A-843A-5939E3758B18@email.com> References: <1C845548-B875-475A-843A-5939E3758B18@email.com> Date: Sat, 19 Feb 2011 00:10:49 -0800 Message-ID: Subject: Re: corrupt blocks after restart From: suresh srinivas To: hdfs-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e6d7edba27ae78049c9e2b2f --0016e6d7edba27ae78049c9e2b2f Content-Type: text/plain; charset=ISO-8859-1 The problem is that replicas for 3609 blocks are not reported to namenode. Do you have datanodes in exclude file? What is the number of registered nodes before start compared to what it is now? Removing all the datanodes from exclude file (if there are any) and restarting the cluster should fix the issue. On Fri, Feb 18, 2011 at 5:43 PM, Chris Tarnas wrote: > I've hit a data curroption problem in a system we were rapidly loading up, > and I could really use some pointers on where to look for the root of the > problem as well as any possible solutions. I'm running the cdh3b3 build of > Hadoop 0.20.2. I experienced some issues with a client (Hbase regionserver) > getting an IOException talking with the namenode. I thought the namenode > might have been resourced starved (maybe not enough RAM). I first ran a fsck > and the filesystem was healthy and then shutdown hadoop (stop-all.sh) to > update the hadoop-env.sh to allocate more memory to the namenode, then > started up hadoop again (start-all.sh). > > After starting up the server I ran another fsck and now the filesystem is > corrupt and about 1/3 or less of the size it should be. All of the datanodes > are online, but it is as if they are all incomplete. > > I've tried using the previous checkpoint from the secondary namenode to no > avail. This is the fsck summary > > blocks of total size 442716 B.Status: CORRUPT > Total size: 416302602463 B > Total dirs: 7571 > Total files: 7525 > Total blocks (validated): 8516 (avg. block size 48884758 B) > ******************************** > CORRUPT FILES: 3343 > MISSING BLOCKS: 3609 > MISSING SIZE: 169401218659 B > CORRUPT BLOCKS: 3609 > ******************************** > Minimally replicated blocks: 4907 (57.62095 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 4740 (55.659935 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 3 > Average block replication: 0.7557539 > Corrupt blocks: 3609 > Missing replicas: 8299 (128.94655 %) > Number of data-nodes: 10 > Number of racks: 1 > > The namenode had quite a few WARNS like this one (The list of excluded > nodes is all of the nodes in the system!) > > 2011-02-18 17:06:40,506 WARN > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place > enough replicas, still in need of 1(excluded: 10.56.24.15:50010, > 10.56.24.19:50010, 10.56.24.16:50010, 10.56.24.20:50010, 10.56.24.14:50010, > 10.56.24.17:50010, 10.56.24.13:50010, 10.56.24.18:50010, 10.56.24.11:50010, > 10.56.24.12:50010) > > > > I grepped for errors and warns on all 10 of the datanode logs and only > found that over the last day two nodes had a total of 8 warns and 1 error: > > node 5: > > 2011-02-18 03:44:56,642 WARN > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: First Verification > failed for blk_-8223286903671115311_101182. Exception : java.io.IOException: > Input/output error > 2011-02-18 03:45:04,440 WARN > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Second Verification > failed for blk_-8223286903671115311_101182. Exception : java.io.IOException: > Input/output error > 2011-02-18 06:53:17,081 WARN > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: First Verification > failed for blk_8689822798201808529_99687. Exception : java.io.IOException: > Input/output error > 2011-02-18 06:53:25,105 WARN > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Second Verification > failed for blk_8689822798201808529_99687. Exception : java.io.IOException: > Input/output error > 2011-02-18 12:09:09,613 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: Could not read or failed > to veirfy checksum for data at offset 25624576 for block > blk_-8776727553170755183_302602 got : java.io.IOException: Input/output > error > 2011-02-18 12:17:03,874 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: Could not read or failed > to veirfy checksum for data at offset 2555904 for block > blk_-1372864350494009223_328898 got : java.io.IOException: Input/output > error > 2011-02-18 13:15:40,637 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: Could not read or failed > to veirfy checksum for data at offset 458752 for block > blk_5554094539319851344_322246 got : java.io.IOException: Input/output error > 2011-02-18 13:12:13,587 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.56.24.15:50010, > storageID=DS-1424058120-10.56.24.15-50010-1297226452840, infoPort=50075, > ipcPort=50020):DataXceiver > > Node 9: > > 2011-02-18 12:02:58,879 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: Could not read or failed > to veirfy checksum for data at offset 16711680 for block > blk_-5196887735268731000_300861 got : java.io.IOException: Input/output > error > > Many thanks for any help or where I should look. > -chris > > > -- Regards, Suresh --0016e6d7edba27ae78049c9e2b2f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable The problem is that replicas for 3609 blocks are not reported to namenode.<= br>
Do you have datanodes in exclude file? What is the number of registe= red nodes before start compared to what it is now? Removing all the datanod= es from exclude file (if there are any) and restarting the cluster should f= ix the issue.

On Fri, Feb 18, 2011 at 5:43 PM, Chris Tarna= s <cft@email.com&= gt; wrote:
I've hit a data curroption problem in a system we were rapidly loading = up, and I could really use some pointers on where to look for the root of t= he problem as well as any possible solutions. I'm running the cdh3b3 bu= ild of Hadoop 0.20.2. I experienced some issues with a client (Hbase region= server) getting an IOException talking with the namenode. I thought the nam= enode might have been resourced starved (maybe not enough RAM). I first ran= a fsck and the filesystem was healthy and then shutdown hadoop (stop-all.s= h) to update the hadoop-env.sh to allocate more memory to the namenode, the= n started up hadoop again (start-all.sh).

After starting up the server I ran another fsck and now the filesystem is c= orrupt and about 1/3 or less of the size it should be. All of the datanodes= are online, but it is as if they are all incomplete.

I've tried using the previous checkpoint from the secondary namenode to= no avail. This is the fsck summary

blocks of total size 442716 B.Status: CORRUPT
=A0Total size: =A0 =A0416302602463 B
=A0Total dirs: =A0 =A07571
=A0Total files: =A0 7525
=A0Total blocks (validated): =A0 =A0 =A08516 (avg. block size 48884758 B) =A0********************************
=A0CORRUPT FILES: =A0 =A0 =A0 =A03343
=A0MISSING BLOCKS: =A0 =A0 =A0 3609
=A0MISSING SIZE: =A0 =A0 =A0 =A0 169401218659 B
=A0CORRUPT BLOCKS: =A0 =A0 =A0 3609
=A0********************************
=A0Minimally replicated blocks: =A0 4907 (57.62095 %)
=A0Over-replicated blocks: =A0 =A0 =A0 =A00 (0.0 %)
=A0Under-replicated blocks: =A0 =A0 =A0 4740 (55.659935 %)
=A0Mis-replicated blocks: =A0 =A0 =A0 =A0 0 (0.0 %)
=A0Default replication factor: =A0 =A03
=A0Average block replication: =A0 =A0 0.7557539
=A0Corrupt blocks: =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A03609
=A0Missing replicas: =A0 =A0 =A0 =A0 =A0 =A0 =A08299 (128.94655 %)
=A0Number of data-nodes: =A0 =A0 =A0 =A0 =A010
=A0Number of racks: =A0 =A0 =A0 =A0 =A0 =A0 =A0 1

The namenode had quite a few WARNS like this one (The list of excluded node= s is all of the nodes in the system!)

2011-02-18 17:06:40,506 WARN org.apache.hadoop.hdfs.server.namenode.FSNames= ystem: Not able to place enough replicas, still in need of 1(excluded: 10.56.24.15:50010, <= a href=3D"http://10.56.24.19:50010" target=3D"_blank">10.56.24.19:50010= , 10.56.24.16:50010<= /a>, 10.56.24.20:500= 10, 10.56.24.14:= 50010, 10.56.24.= 17:50010, 10.56.= 24.13:50010, 10.= 56.24.18:50010, = 10.56.24.11:50010, 10.56.24.12:50010)



I grepped for errors and warns on all 10 of the datanode logs and only foun= d that over the last day two nodes had a total of 8 warns and 1 error:

node 5:

2011-02-18 03:44:56,642 WARN org.apache.hadoop.hdfs.server.datanode.DataBlo= ckScanner: First Verification failed for blk_-8223286903671115311_101182. E= xception : java.io.IOException: Input/output error
2011-02-18 03:45:04,440 WARN org.apache.hadoop.hdfs.server.datanode.DataBlo= ckScanner: Second Verification failed for blk_-8223286903671115311_101182. = Exception : java.io.IOException: Input/output error
2011-02-18 06:53:17,081 WARN org.apache.hadoop.hdfs.server.datanode.DataBlo= ckScanner: First Verification failed for blk_8689822798201808529_99687. Exc= eption : java.io.IOException: Input/output error
2011-02-18 06:53:25,105 WARN org.apache.hadoop.hdfs.server.datanode.DataBlo= ckScanner: Second Verification failed for blk_8689822798201808529_99687. Ex= ception : java.io.IOException: Input/output error
2011-02-18 12:09:09,613 WARN org.apache.hadoop.hdfs.server.datanode.DataNod= e: =A0Could not read or failed to veirfy checksum for data at offset 256245= 76 for block blk_-8776727553170755183_302602 got : java.io.IOException: Inp= ut/output error
2011-02-18 12:17:03,874 WARN org.apache.hadoop.hdfs.server.datanode.DataNod= e: =A0Could not read or failed to veirfy checksum for data at offset 255590= 4 for block blk_-1372864350494009223_328898 got : java.io.IOException: Inpu= t/output error
2011-02-18 13:15:40,637 WARN org.apache.hadoop.hdfs.server.datanode.DataNod= e: =A0Could not read or failed to veirfy checksum for data at offset 458752= for block blk_5554094539319851344_322246 got : java.io.IOException: Input/= output error
2011-02-18 13:12:13,587 ERROR org.apache.hadoop.hdfs.server.datanode.DataNo= de: DatanodeRegistration(10.56.24.15:50010, storageID=3DDS-1424058120-10.56.24.15-50010-129= 7226452840, infoPort=3D50075, ipcPort=3D50020):DataXceiver

Node 9:

2011-02-18 12:02:58,879 WARN org.apache.hadoop.hdfs.server.datanode.DataNod= e: =A0Could not read or failed to veirfy checksum for data at offset 167116= 80 for block blk_-5196887735268731000_300861 got : java.io.IOException: Inp= ut/output error

Many thanks for any help or where I should look.
-chris





--
Regards,
Sure= sh

--0016e6d7edba27ae78049c9e2b2f--