Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 2116 invoked from network); 25 Oct 2009 01:17:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Oct 2009 01:17:23 -0000 Received: (qmail 37159 invoked by uid 500); 25 Oct 2009 01:17:23 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 37099 invoked by uid 500); 25 Oct 2009 01:17:23 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 37089 invoked by uid 99); 25 Oct 2009 01:17:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Oct 2009 01:17:23 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Oct 2009 01:17:20 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 5E948234C045 for ; Sat, 24 Oct 2009 18:16:59 -0700 (PDT) Message-ID: <1207933396.1256433419372.JavaMail.jira@brutus> Date: Sun, 25 Oct 2009 01:16:59 +0000 (UTC) From: "Christian Kunz (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Commented: (HDFS-732) HDFS files are ending up truncated In-Reply-To: <330655374.1256432099391.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769743#action_12769743 ] Christian Kunz commented on HDFS-732: ------------------------------------- The file size of the file should have been 18654752, computed by adding up all the bytes successfully written. But it ended up having only 17825792. I verified that the missing data was at the end, i.e. the block got somehow recovered from an incomplete replica. Here are relevant log messages from the namenode: grep blk_6703874482275767879 hadoop-user-namenode-host.log.2009-10-23 2009-10-23 19:46:47,815 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: dir/output/app/_temporary/_attempt_200910221954_0001_r_009110_0/9110/filename. blk_6703874482275767879_76799972 2009-10-23 21:16:00,028 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_6703874482275767879_76799972, newgenerationstamp=76840998, newlength=17825792, newtargets=[xxx.yyy.zzz.56:uuu10, xxx.yyy.zzz.44:uuu10], closeFile=false, deleteBlock=false) 2009-10-23 21:16:00,028 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(blk_6703874482275767879_76840998) successful 2009-10-23 21:16:00,397 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_6703874482275767879_76840998, newgenerationstamp=76840999, newlength=17825792, newtargets=[xxx.yyy.zzz.44:uuu10], closeFile=false, deleteBlock=false) 2009-10-23 21:16:00,397 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(blk_6703874482275767879_76840999) successful 2009-10-23 22:16:02,159 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* blk_6703874482275767879_76840999 recovery started, primary=xxx.yyy.zzz.44:uuu10 2009-10-23 22:16:02,925 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_6703874482275767879_76888761 on xxx.yyy.zzz.44:uuu10 size 17825792 But it does not belong to any file. 2009-10-23 22:16:02,925 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_6703874482275767879_76840999, newgenerationstamp=76888761, newlength=17825792, newtargets=[xxx.yyy.zzz.44:uuu10], closeFile=true, deleteBlock=false) 2009-10-23 22:16:02,931 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(newblock=blk_6703874482275767879_76888761, file=dir/output/app/9110/filename, newgenerationstamp=76888761, newlength=17825792, newtargets=[xxx.yyy.zzz.44:uuu10]) successful 2009-10-23 22:16:05,112 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask xxx.yyy.zzz.44:uuu10 to replicate blk_6703874482275767879_76888761 to datanode(s) xxx.yyy.zzz.51:uuu10 xxx.yyy.zzz.237:uuu10 2009-10-23 22:16:06,310 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: xxx.yyy.zzz.51:uuu10 is added to blk_6703874482275767879_76888761 size 17825792 2009-10-23 22:16:06,312 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: xxx.yyy.zzz.237:uuu10 is added to blk_6703874482275767879_76888761 size 17825792 Relevant log messages of the task (the block went through recovery): 09/10/23 21:15:59 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_6703874482275767879_76799972java.io.IOException: Connection reset by peer 09/10/23 21:15:59 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76799972 bad datanode[0] xxx.yyy.zzz.43:uuu10 09/10/23 21:15:59 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76799972 in pipeline xxx.yyy.zzz.43:uuu10, xxx.yyy.zzz.56:uuu10, xxx.yyy.zzz.44:uuu10: bad datanode xxx.yyy.zzz.43:uuu10 09/10/23 21:16:00 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Connection reset by peer 09/10/23 21:16:00 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76840998 bad datanode[0] xxx.yyy.zzz.56:uuu10 09/10/23 21:16:00 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76840998 in pipeline xxx.yyy.zzz.56:uuu10, xxx.yyy.zzz.44:uuu10: bad datanode xxx.yyy.zzz.56:uuu10 09/10/23 21:16:00 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Connection reset by peer 09/10/23 21:16:00 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76840999 bad datanode[0] xxx.yyy.zzz.44:uuu10 Exception in thread "main" java.io.IOException: All datanodes xxx.yyy.zzz.44:uuu10 are bad. Aborting.. First attempt to close the file was unsuccessful, but second attempt was successful (but with truncated size). > HDFS files are ending up truncated > ---------------------------------- > > Key: HDFS-732 > URL: https://issues.apache.org/jira/browse/HDFS-732 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.20.1 > Reporter: Christian Kunz > > We recently started to use hadoop-0.20.1 in our production environment (less than 2 weeks ago) and already had 3 instances of truncated files, more than we had for months using hadoop-0.18.3. > Writing is done using libhdfs, although it rather seems to be a problem on the server side. > I will post some relevant logs (they are too large to be put into the description) > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.