hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshay Singh <akshay_i...@yahoo.com>
Subject Multiple fs.FSInputChecker: Found checksum error .. because of load ?
Date Wed, 23 May 2012 20:38:12 GMT
Hi,



I am trying to run few benchmarks on a small hadoop-cluster of 4 VMs (2 on 2 phyiscal hosts,
each VM having 1 cpu core, 2GB ram, individual disk and Gbps bridged connectivity). I am using
virtualbox as VMM.


This workload reads good number of random small files (64MB each) concurrently from all the
HDFSdatanodes, throuh clients running on same set of VMs. I am using FsShell cat to read the
files, and I see these checksum errors:

12/05/22 10:10:12 INFO fs.FSInputChecker: Found checksum error: b[3072, 3584]=cb93678dc0259c978731af408f2cb493b510c948b45039a4853688fd21c2a070fc030000ff7b807f000033d20100080027
cf09e308002761d4480800450005dc2af04000400633ca816169cf816169d0c35a87c1b090973e78aa5ef880100e24446b00000101080a020fcf7b020fcea7d85a506ff1eaea5383eea539137745249aebc25e86d0feac89
c4e0c9b91bc09ee146af7e9bd103c8269486a8c748091cfc42e178f461d9127f6c9676f47fa6863bb19f2e51142725ae643ffdfbe7027798e1f11314d9aa877db99a86db25f2f6d18d5b86062de737147b918e829fb178cf
bbb57e932ab082197b1f4fa4315eae67210018c3c034b3f52481c4cebc53d1e2fd5ad4b67d87823f5e0923fa1ff579de88768f79a6df5f86a8a7eb3a68b3366063408b7292eef8f909580e3866676838ba8417bb810d9a9e
8d12c49de4522214e1c6a22b64394a1e60e020b12d5803d2b6a53fe64d00b85dc63c67a8a94758f71a7a06a786e168ea234030806026ffed07770ba6d407437a4a83b96c2b3a3c767d834a19c438a0d6f56ca6fc9099d375
ae1f95839c62f36a466818eb816d4d3ef6f3951ce3a19a3364a827bac8fd70833587c89084b847e4ceeae48df9256ef629c6325f67872478838777885f930710b71c02256b0cc66242d4974fbfb0ebcf85ef6cf4b67656dc
6918bc57083dc8868e34662c98e183163a9fc82a42fddc
org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197
at 52284416
        at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1457)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2172)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
        at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114)
        at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49)
        at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:349)
        at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1913)
        at org.apache.hadoop.fs.FsShell.cat(FsShell.java:346)
        at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1557)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1776)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895)
12/05/22 10:10:13 WARN hdfs.DFSClient: Found Checksum error for blk_2250776182612718654_6078
from XX.XX.XX.207:50010 at 52284416
12/05/22 10:10:13 INFO hdfs.DFSClient: Could not obtain block blk_2250776182612718654_6078
from any node: java.io.IOException: No live nodes contain current block. Will get new
 block locations from namenode and retry...
cat: Checksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197 at 52284416
cat: Checksum error: /blk_-5591790629390980895:of:/user/hduser/15-1/part-00192 at 30324736

Hadoop FSCK does not report any corrupt block after writing the data, but after every iteration
of reading the data I see new corrupt blocks (with output as above). Interestingly,  higher
the load (concurrent sequential reads) I put on DFS cluster chances of blocks getting corrupted
increase. I (mostly) do not see any corruption happening when there is no or less contention
at DFS servers for reads. 

I see few other people on web also faced the same problem :

http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/508
http://tinyurl.com/7rsckwo

It has been suggested on these threads that faulty hardware may be causing this issue, and
these checksum errors are likely to tell so. So, I diagnosed my RAM (non ECC one) and HDDs
but did not find any problem there. I dont have ECC ram to try with. And what makes me more
doubtful about hardware being the culprit is the fact that same workloads work fine on same
set of physical machines (and more) and do not cause any corruption of blocks. I also tried
creating fresh VMs multiple times but did not help. 

Anybody has any suggestions on this ? Not sure, if the reason is weak VMs as I can see corruption
happening only with VMs and corruption increases  as I increase the load on DFS cluster.

Thanks,
Akshay
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message