hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tarnas <...@email.com>
Subject corrupt blocks after restart
Date Sat, 19 Feb 2011 01:43:00 GMT
I've hit a data curroption problem in a system we were rapidly loading up, and I could really
use some pointers on where to look for the root of the problem as well as any possible solutions.
I'm running the cdh3b3 build of Hadoop 0.20.2. I experienced some issues with a client (Hbase
regionserver) getting an IOException talking with the namenode. I thought the namenode might
have been resourced starved (maybe not enough RAM). I first ran a fsck and the filesystem
was healthy and then shutdown hadoop (stop-all.sh) to update the hadoop-env.sh to allocate
more memory to the namenode, then started up hadoop again (start-all.sh).

After starting up the server I ran another fsck and now the filesystem is corrupt and about
1/3 or less of the size it should be. All of the datanodes are online, but it is as if they
are all incomplete. 

I've tried using the previous checkpoint from the secondary namenode to no avail. This is
the fsck summary

blocks of total size 442716 B.Status: CORRUPT
 Total size:	416302602463 B
 Total dirs:	7571
 Total files:	7525
 Total blocks (validated):	8516 (avg. block size 48884758 B)
  ********************************
  CORRUPT FILES:	3343
  MISSING BLOCKS:	3609
  MISSING SIZE:		169401218659 B
  CORRUPT BLOCKS: 	3609
  ********************************
 Minimally replicated blocks:	4907 (57.62095 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	4740 (55.659935 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	0.7557539
 Corrupt blocks:		3609
 Missing replicas:		8299 (128.94655 %)
 Number of data-nodes:		10
 Number of racks:		1

The namenode had quite a few WARNS like this one (The list of excluded nodes is all of the
nodes in the system!)

2011-02-18 17:06:40,506 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able
to place enough replicas, still in need of 1(excluded: 10.56.24.15:50010, 10.56.24.19:50010,
10.56.24.16:50010, 10.56.24.20:50010, 10.56.24.14:50010, 10.56.24.17:50010, 10.56.24.13:50010,
10.56.24.18:50010, 10.56.24.11:50010, 10.56.24.12:50010)



I grepped for errors and warns on all 10 of the datanode logs and only found that over the
last day two nodes had a total of 8 warns and 1 error:

node 5:

2011-02-18 03:44:56,642 WARN org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: First
Verification failed for blk_-8223286903671115311_101182. Exception : java.io.IOException:
Input/output error
2011-02-18 03:45:04,440 WARN org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Second
Verification failed for blk_-8223286903671115311_101182. Exception : java.io.IOException:
Input/output error
2011-02-18 06:53:17,081 WARN org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: First
Verification failed for blk_8689822798201808529_99687. Exception : java.io.IOException: Input/output
error
2011-02-18 06:53:25,105 WARN org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Second
Verification failed for blk_8689822798201808529_99687. Exception : java.io.IOException: Input/output
error
2011-02-18 12:09:09,613 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:  Could not read
or failed to veirfy checksum for data at offset 25624576 for block blk_-8776727553170755183_302602
got : java.io.IOException: Input/output error
2011-02-18 12:17:03,874 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:  Could not read
or failed to veirfy checksum for data at offset 2555904 for block blk_-1372864350494009223_328898
got : java.io.IOException: Input/output error
2011-02-18 13:15:40,637 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:  Could not read
or failed to veirfy checksum for data at offset 458752 for block blk_5554094539319851344_322246
got : java.io.IOException: Input/output error
2011-02-18 13:12:13,587 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.56.24.15:50010,
storageID=DS-1424058120-10.56.24.15-50010-1297226452840, infoPort=50075, ipcPort=50020):DataXceiver

Node 9:

2011-02-18 12:02:58,879 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:  Could not read
or failed to veirfy checksum for data at offset 16711680 for block blk_-5196887735268731000_300861
got : java.io.IOException: Input/output error

Many thanks for any help or where I should look.
-chris



Mime
View raw message