hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject HDFS file disappeared
Date Thu, 11 Nov 2010 17:31:55 GMT
Had a really peculiar thing happen today:  a file that a job of mine 
created on HDFS seems to have disappeared, and I'm scratching my head as 
to how this could have happened without any errors getting thrown.

I ran a M/R job that created a big bunch of files.  Job completed 
without errors, and it seems like file got created correctly.  However 
when another automated process went to use the file later, it failed 
because the file wasn't there.

When I Sherlock Holmes my way through various log files, I see the 

1) M/R job task says that it successfully created the file:

2010-11-10 21:42:28,356 INFO <our package 
name>.StateFileGeneratorOutputFormat: Data from partition #1 being 
written to file hdfs://<our name node name>:9000<our root 

2) Name node also says that it created the file:

[root@hdmaster hadoop-0.20]# grep 2010.11.10-21.05.29 
hadoop-hadoop-namenode-hdmaster.log.2010-11-10 | grep 
2010-11-10 21:42:28,442 INFO 
ugi=root,root,bin,daemon,sys,adm,disk,wheel     ip=/<our data node> 
   cmd=create      src=<our root 

  dst=null        perm=root:supergroup:rw-r--r--
2010-11-10 21:42:29,802 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.allocateBlock: <our root 

2010-11-10 21:42:30,360 INFO org.apache.hadoop.hdfs.StateChange: 
Removing lease on  file <our root 

from client DFSClient_attempt_201010221550_0418_r_000001_0
2010-11-10 21:42:30,360 INFO org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.completeFile: file <our root 

is closed by DFSClient_attempt_201010221550_0418_r_000001_0

2a) Name node does *not* say in the logs anywhere that it deleted the 
file.  (i.e., the above 4 log lines are the only lines in the name node 
log containing this file name)

3) Data node says that it created the file ...

[root@hddata09 hadoop-0.20]# grep 4237880568969698703_13404582 
2010-11-10 21:42:29,827 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block 
blk_-4237880568969698703_13404582 src: /<our data node>:43883 dest: 
/<our data node>:50010
2010-11-10 21:42:30,354 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /<our 
data node>:43883, dest: /<our data node>:50010, bytes: 326522, op: 
HDFS_WRITE, cliID: DFSClient_attempt_201010221550_0418_r_000001_0, 
offset: 0, srvID: DS-229869750-<our data node>-50010-1286316976632, 
blockid: blk_-4237880568969698703_13404582, duration: 150566000
2010-11-10 21:42:30,354 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for 
block blk_-4237880568969698703_13404582 terminating

4) ... but then it deletes it a few seconds later:

2010-11-10 21:42:34,831 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block 
blk_-4237880568969698703_13404582 file 
for deletion
2010-11-10 21:42:34,832 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block 
blk_-4237880568969698703_13404582 at file 

Note that a) no errors were thrown anywhere about this file or block, 
and b) there's nothing in the name node log indicating that the file was 
explicitly deleted.

I'm scratching my head as to how this could have happened.  Anyone have 
any ideas / further debugging suggestions?



View raw message