hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <enis.soz.nu...@gmail.com>
Subject data loss after power recovery
Date Mon, 12 Nov 2007 15:15:17 GMT
Hi,

After a serious power failure on our cluster running 0.13.0, we have 
been able to restore our previous state. But we have realized that a 
nontrivial amount of blocks are missing. It seems that namenode has 
requested all the blocks which are kept on one specific machine to be 
deleted, which resulted in deletion of all the replicas. To clarify, for 
some reason all the blocks on the machine as well as all the other 
replicas of the blocks are deleted by the namenode. Does anyone know 
what might have happened ? Is this a bug that we should seriously 
consider fixing, or it may have been already fixed?

datanode which caused data loss was : 192.168.15.233, and it is first 
started as a slave, then removed to add a new hard disk and added back 
to the cluster

Below are the relevant logs :

Namenode :

2007-11-11 19:15:11,564 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.15.233:50010
2007-11-11 19:15:12,094 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode:
node registration fro2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE*
SafeModeInfo.leave: Safe mode is OFF.
...
2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has
1 racks and 36 datanodes
2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE* UnderReplicatedBlocks
has 56 blocksm 192.168.15.231:50010 storage DS1698199061
...
2007-11-11 19:30:05,782 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log
2007-11-11 19:30:40,469 INFO org.apache.hadoop.fs.FSNamesystem: Roll FSImage
2007-11-11 19:31:29,913 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode:
node registration from 192.168.15.236:50010 storage DS1183829041
...
2007-11-11 19:45:03,483 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck:
lost heartbeat from 192.168.15.233:50010
2007-11-11 19:45:03,734 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.15.233:50010
...
2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.15.233:50010
2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.15.233:50010
...


and example logs for one of the missing blocks : blk_8859727972037265136

on 192.168.15.203
2007-11-11 19:53:53,755 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136
file /data/hadoop/dfs/data/current/subdir63/subdir63/subdir63/subdir63/subdir49/blk_8859727972037265136

on 192.168.15.225
2007-11-11 20:18:07,964 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136
file /data2/hadoop/dfs/data/current/subdir11/subdir63/blk_8859727972037265136

on 192.168.15.233
2007-11-11 19:54:56,078 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136
file /data/hadoop/dfs/data/current/subdir36/subdir47/blk_8859727972037265136

and the complete log for 192.168.15.233 is :

... 
2007-11-11 20:03:37,789 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_3987170016844853189
file /data/hadoop/dfs/data/current/subdir38/blk_3987170016844853189
2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_4414509271638104493
file /data/hadoop/dfs/data/current/subdir56/subdir40/blk_4414509271638104493
2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_4651660909902273726
file /data/hadoop/dfs/data/current/subdir32/subdir3/blk_4651660909902273726
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5189049009734931732
file /data/hadoop/dfs/data/current/subdir56/subdir42/blk_5189049009734931732
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5395031642694782019
file /data/hadoop/dfs/data/current/subdir41/subdir31/blk_5395031642694782019
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5567722351418795177
file /data/hadoop/dfs/data/current/subdir56/subdir42/blk_5567722351418795177
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5592463115430469494
file /data/hadoop/dfs/data/current/subdir10/subdir48/blk_5592463115430469494
... (for all blocks in the datanode)

2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete
block blk_-9219752334498294080. Block not found in blockMap.
2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete
block blk_-9217018193785551154. Block not found in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete
block blk_-9211664991594450527. Block not found in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete
block blk_-9211471391608631351. Block not found in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete
block blk_-9208445774532268187. Block not found in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete
block blk_-9202539319669633125. Block not found in blockMap.
...



Thanks in advance. 
Enis Soztutar











Mime
View raw message