Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Message-ID: <1207933396.1256433419372.JavaMail.jira@brutus>
Date: Sun, 25 Oct 2009 01:16:59 +0000 (UTC)
From: "Christian Kunz (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Subject: [jira] Commented: (HDFS-732) HDFS files are ending up truncated
In-Reply-To: <330655374.1256432099391.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769743#action_12769743 ] 

Christian Kunz commented on HDFS-732:
-------------------------------------

The file size of the file should have been 18654752, computed by adding up all the bytes successfully written. But it ended up having only 17825792. I verified that the missing data was at the end, i.e. the block got somehow recovered from an incomplete replica.

Here are relevant log messages from the namenode:

grep blk_6703874482275767879 hadoop-user-namenode-host.log.2009-10-23

2009-10-23 19:46:47,815 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock:
dir/output/app/_temporary/_attempt_200910221954_0001_r_009110_0/9110/filename.
blk_6703874482275767879_76799972
2009-10-23 21:16:00,028 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_6703874482275767879_76799972, newgenerationstamp=76840998, newlength=17825792,
newtargets=[xxx.yyy.zzz.56:uuu10, xxx.yyy.zzz.44:uuu10], closeFile=false, deleteBlock=false)
2009-10-23 21:16:00,028 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(blk_6703874482275767879_76840998) successful
2009-10-23 21:16:00,397 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_6703874482275767879_76840998, newgenerationstamp=76840999, newlength=17825792,
newtargets=[xxx.yyy.zzz.44:uuu10], closeFile=false, deleteBlock=false)
2009-10-23 21:16:00,397 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(blk_6703874482275767879_76840999) successful
2009-10-23 22:16:02,159 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* blk_6703874482275767879_76840999 recovery
started, primary=xxx.yyy.zzz.44:uuu10
2009-10-23 22:16:02,925 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock
request received for blk_6703874482275767879_76888761 on xxx.yyy.zzz.44:uuu10 size 17825792 But it does not belong to any
file.
2009-10-23 22:16:02,925 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_6703874482275767879_76840999, newgenerationstamp=76888761, newlength=17825792,
newtargets=[xxx.yyy.zzz.44:uuu10], closeFile=true, deleteBlock=false)
2009-10-23 22:16:02,931 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(newblock=blk_6703874482275767879_76888761,
file=dir/output/app/9110/filename,
newgenerationstamp=76888761, newlength=17825792, newtargets=[xxx.yyy.zzz.44:uuu10]) successful
2009-10-23 22:16:05,112 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask xxx.yyy.zzz.44:uuu10 to replicate
blk_6703874482275767879_76888761 to datanode(s) xxx.yyy.zzz.51:uuu10 xxx.yyy.zzz.237:uuu10
2009-10-23 22:16:06,310 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated:
xxx.yyy.zzz.51:uuu10 is added to blk_6703874482275767879_76888761 size 17825792
2009-10-23 22:16:06,312 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated:
xxx.yyy.zzz.237:uuu10 is added to blk_6703874482275767879_76888761 size 17825792


Relevant log messages of the task (the block went through recovery):

09/10/23 21:15:59 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
blk_6703874482275767879_76799972java.io.IOException: Connection reset by peer
09/10/23 21:15:59 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76799972 bad datanode[0]
xxx.yyy.zzz.43:uuu10
09/10/23 21:15:59 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76799972 in pipeline
xxx.yyy.zzz.43:uuu10, xxx.yyy.zzz.56:uuu10, xxx.yyy.zzz.44:uuu10: bad datanode xxx.yyy.zzz.43:uuu10
09/10/23 21:16:00 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Connection reset by peer
09/10/23 21:16:00 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76840998 bad datanode[0]
xxx.yyy.zzz.56:uuu10
09/10/23 21:16:00 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76840998 in pipeline
xxx.yyy.zzz.56:uuu10, xxx.yyy.zzz.44:uuu10: bad datanode xxx.yyy.zzz.56:uuu10
09/10/23 21:16:00 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Connection reset by peer
09/10/23 21:16:00 WARN hdfs.DFSClient: Error Recovery for block blk_6703874482275767879_76840999 bad datanode[0]
xxx.yyy.zzz.44:uuu10
Exception in thread "main" java.io.IOException: All datanodes xxx.yyy.zzz.44:uuu10 are bad. Aborting..

First attempt to close the file was unsuccessful, but second attempt was successful (but with truncated size).


> HDFS files are ending up truncated
> ----------------------------------
>
>                 Key: HDFS-732
>                 URL: https://issues.apache.org/jira/browse/HDFS-732
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1
>            Reporter: Christian Kunz
>
> We recently started to use hadoop-0.20.1 in our production environment (less than 2 weeks ago) and already had 3 instances of truncated files, more than we had for months using hadoop-0.18.3.
> Writing is done using libhdfs, although it rather seems to be a problem on the server side.
> I will post some relevant logs (they are too large to be put into the description)
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.