Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 24BC5996E for ; Wed, 23 May 2012 20:38:49 +0000 (UTC) Received: (qmail 77360 invoked by uid 500); 23 May 2012 20:38:45 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 77093 invoked by uid 500); 23 May 2012 20:38:45 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 77084 invoked by uid 99); 23 May 2012 20:38:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 May 2012 20:38:45 +0000 X-ASF-Spam-Status: No, hits=4.7 required=5.0 tests=FREEMAIL_FORGED_REPLYTO,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [106.10.149.134] (HELO nm20-vm7.bullet.mail.sg3.yahoo.com) (106.10.149.134) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 23 May 2012 20:38:35 +0000 Received: from [106.10.166.60] by nm20.bullet.mail.sg3.yahoo.com with NNFMP; 23 May 2012 20:38:12 -0000 Received: from [106.10.151.254] by tm17.bullet.mail.sg3.yahoo.com with NNFMP; 23 May 2012 20:38:12 -0000 Received: from [127.0.0.1] by omp1003.mail.sg3.yahoo.com with NNFMP; 23 May 2012 20:38:12 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 505614.41529.bm@omp1003.mail.sg3.yahoo.com Received: (qmail 50234 invoked by uid 60001); 23 May 2012 20:38:12 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1337805492; bh=h89iayG/344Zq/Q6NxKlWAJKeewXe2cgP8U/zsRUfQU=; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type; b=s4yampqbzHn1jI5vPEH59YD0qZIa31kRVmNgiWaXW4DC2/i2SbZbefhoYziDhVYaBFRuQOX9btWJKjsBhR/f/28mpBqT2FH0H4g7aW8UIVcH4p2kZs55Tg9QwfLYTM+CbwUMzv2K43OOKNLHjCf7NMnLHF06zZk5VYg7E8jaYTY= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type; b=0+WFLNkjcGktPRqhN3sFkLJ2he1TM9nVPAgkkAEzDu3rwOgNaLvyHmxOXrgQ7vODXIKlYHvOOse81GVQ/BeFGsfVLjisaMHu87fnHZmlHIptiU4p6rgBOFqQHQcT7D4HfsQqEpqkCQ/Kd2ufkZ0+Y6oM1MFAcMceO5C/XYNiq1s=; X-YMail-OSG: FjaZANwVM1n_goWAPwhcYcQZpOeJiXQ.l1LJJySKcgwwrJi yyDXYfzQhrXjplnrMInA.2X34NVJoZsnXu8SwkU8UBb67d4eRsxhwm_AotFv PUaEYzahhI8GRDDONk0zW7CB0iwlO.yqGVFiSq5qzWKtOWdbhX9tH.CWudn1 Arni4XAzlDTdY3ha4L9Fn514Ta8yYg4FIaUrdCyUt3w7jCTPsNKx7GJUiHSP uFGWPqAmpKdR4vJI7XZC7YfaQXBS86rCRpdvMwnbJs7q16rRKaKEAPEOLapl oUjTIHgIVjsfuxCOOXkmg4ab8etNbVQFOo0FfrPmwluGSOANug2r.hNdynqU nVU6LvVziM7J4J3WOvPRD_UJq6a_bbomEc8KrYuFlHssFyDsEsGhp1BWfu49 eEoGGxeQAM16iba4sKe0sw55Nn2Xox5GhP8EKhoksliTwc8Ufrd3xaewCOq3 MzUOydg6hi7TRizk7L5sA24vl3k2uJB.R.UFme0mfqkDVZoiXUNhYhAg_aQ5 50FKleBRu2kA70I5T6PJRk1mcuTtTNvZI9UQioQX7CA5yx906mjHpzLyjDYY tkVQ_81t1uH0TK8ItKYWONbgpseMVi0AkAxiZ1Q-- Received: from [129.97.105.171] by web193506.mail.sg3.yahoo.com via HTTP; Thu, 24 May 2012 04:38:12 SGT X-Mailer: YahooMailWebService/0.8.118.349524 References: Message-ID: <1337805492.48520.YahooMailNeo@web193506.mail.sg3.yahoo.com> Date: Thu, 24 May 2012 04:38:12 +0800 (SGT) From: Akshay Singh Reply-To: Akshay Singh Subject: Multiple fs.FSInputChecker: Found checksum error .. because of load ? To: "common-user@hadoop.apache.org" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="-1785158075-1821429574-1337805492=:48520" ---1785158075-1821429574-1337805492=:48520 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Hi,=0A=0A=0A=0AI am trying to run few benchmarks on a small hadoop-cluster = of 4 VMs (2 on 2 phyiscal hosts, each VM having 1 cpu core, 2GB ram, indivi= dual disk and Gbps bridged connectivity). I am using virtualbox as VMM.=0A= =0A=0AThis workload reads good number of random small files (64MB each) con= currently from all the HDFSdatanodes, throuh clients running on same set of= VMs. I am using FsShell cat to read the files, and I see these checksum er= rors:=0A=0A12/05/22 10:10:12 INFO fs.FSInputChecker: Found checksum error: = b[3072, 3584]=3Dcb93678dc0259c978731af408f2cb493b510c948b45039a4853688fd21c= 2a070fc030000ff7b807f000033d20100080027=0Acf09e308002761d4480800450005dc2af= 04000400633ca816169cf816169d0c35a87c1b090973e78aa5ef880100e24446b0000010108= 0a020fcf7b020fcea7d85a506ff1eaea5383eea539137745249aebc25e86d0feac89=0Ac4e0= c9b91bc09ee146af7e9bd103c8269486a8c748091cfc42e178f461d9127f6c9676f47fa6863= bb19f2e51142725ae643ffdfbe7027798e1f11314d9aa877db99a86db25f2f6d18d5b86062d= e737147b918e829fb178cf=0Abbb57e932ab082197b1f4fa4315eae67210018c3c034b3f524= 81c4cebc53d1e2fd5ad4b67d87823f5e0923fa1ff579de88768f79a6df5f86a8a7eb3a68b33= 66063408b7292eef8f909580e3866676838ba8417bb810d9a9e=0A8d12c49de4522214e1c6a= 22b64394a1e60e020b12d5803d2b6a53fe64d00b85dc63c67a8a94758f71a7a06a786e168ea= 234030806026ffed07770ba6d407437a4a83b96c2b3a3c767d834a19c438a0d6f56ca6fc909= 9d375=0Aae1f95839c62f36a466818eb816d4d3ef6f3951ce3a19a3364a827bac8fd7083358= 7c89084b847e4ceeae48df9256ef629c6325f67872478838777885f930710b71c02256b0cc6= 6242d4974fbfb0ebcf85ef6cf4b67656dc=0A6918bc57083dc8868e34662c98e183163a9fc8= 2a42fddc=0Aorg.apache.hadoop.fs.ChecksumException: Checksum error: /blk_225= 0776182612718654:of:/user/hduser/15-3/part-00197 at 52284416=0A=A0=A0=A0=A0= =A0=A0=A0 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.j= ava:277)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.fs.FSInputChecker.rea= dChecksumChunk(FSInputChecker.java:241)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apac= he.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)=0A=A0=A0=A0=A0= =A0=A0=A0 at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:1= 58)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.hdfs.DFSClient$BlockReader= .read(DFSClient.java:1457)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.hdf= s.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2172)=0A=A0=A0=A0=A0= =A0=A0=A0 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient= .java:2224)=0A=A0=A0=A0=A0=A0=A0=A0 at java.io.DataInputStream.read(DataInp= utStream.java:100)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.io.IOUtils.= copyBytes(IOUtils.java:74)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.io.= IOUtils.copyBytes(IOUtils.java:47)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.ha= doop.io.IOUtils.copyBytes(IOUtils.java:100)=0A=A0=A0=A0=A0=A0=A0=A0 at org.= apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114)=0A=A0=A0=A0=A0=A0= =A0=A0 at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49)=0A=A0=A0= =A0=A0=A0=A0=A0 at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:349)= =0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.fs.FsShell$DelayedExceptionTh= rowing.globAndProcess(FsShell.java:1913)=0A=A0=A0=A0=A0=A0=A0=A0 at org.apa= che.hadoop.fs.FsShell.cat(FsShell.java:346)=0A=A0=A0=A0=A0=A0=A0=A0 at org.= apache.hadoop.fs.FsShell.doall(FsShell.java:1557)=0A=A0=A0=A0=A0=A0=A0=A0 a= t org.apache.hadoop.fs.FsShell.run(FsShell.java:1776)=0A=A0=A0=A0=A0=A0=A0= =A0 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)=0A=A0=A0= =A0=A0=A0=A0=A0 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79= )=0A=A0=A0=A0=A0=A0=A0=A0 at org.apache.hadoop.fs.FsShell.main(FsShell.java= :1895)=0A12/05/22 10:10:13 WARN hdfs.DFSClient: Found Checksum error for bl= k_2250776182612718654_6078 from XX.XX.XX.207:50010 at 52284416=0A12/05/22 1= 0:10:13 INFO hdfs.DFSClient: Could not obtain block blk_2250776182612718654= _6078 from any node: java.io.IOException: No live nodes contain current blo= ck. Will get new=0A=A0block locations from namenode and retry...=0Acat: Che= cksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197 at 52= 284416=0Acat: Checksum error: /blk_-5591790629390980895:of:/user/hduser/15-= 1/part-00192 at 30324736=0A=0AHadoop FSCK does not report any corrupt block= after writing the data, but after every iteration of reading the data I se= e new corrupt blocks (with output as above). Interestingly, higher the loa= d (concurrent sequential reads) I put on DFS cluster chances of blocks gett= ing corrupted increase. I (mostly) do not see any corruption happening when= there is no or less contention at DFS servers for reads. =0A=0AI see few o= ther people on web also faced the same problem :=0A=0Ahttp://comments.gmane= .org/gmane.comp.jakarta.lucene.hadoop.user/508=0Ahttp://tinyurl.com/7rsckwo= =0A=0AIt has been suggested on these threads that faulty hardware may be ca= using this issue, and these checksum errors are likely to tell so. So, I di= agnosed my RAM (non ECC one) and HDDs but did not find any problem there. I= dont have ECC ram to try with. And what makes me more doubtful about hardw= are being the culprit is the fact that same workloads work fine on same set= of physical machines (and more) and do not cause any corruption of blocks.= I also tried creating fresh VMs multiple times but did not help. =0A=0AAny= body has any suggestions on this ? Not sure, if the reason is weak VMs as I= can see corruption happening only with VMs and corruption increases=A0 as = I increase the load on DFS cluster.=0A=0AThanks,=0AAkshay ---1785158075-1821429574-1337805492=:48520--