hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Srinivas <sur...@hortonworks.com>
Subject Re: data loss after cluster wide power loss
Date Mon, 01 Jul 2013 22:16:57 GMT
Yes this is a known issue.

The HDFS part of this was addressed in
https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
available in 1.x  release. I think HBase does not use this API yet.


On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <latham@davelink.net> wrote:

> We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> data center we were in had a total power failure and the cluster went down
> hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> recovered the data in question from our secondary datacenter, but I'm
> trying to understand what happened and whether this is a bug in HDFS that
> should be fixed.
>
> From what I can tell the file was created and closed by the dfs client
> (hbase).  Then HBase renamed it into a new directory and deleted some other
> files containing the same data.  Then the cluster lost power.  After the
> cluster was restarted, the datanodes reported into the namenode but the
> blocks for this file appeared as "blocks being written" - the namenode
> rejected them and the datanodes deleted the blocks.  At this point there
> were no replicas for the blocks and the files were marked CORRUPT.  The
> underlying file systems are ext3.  Some questions that I would love get
> answers for if anyone with deeper understanding of HDFS can chime in:
>
>  - Is this a known scenario where data loss is expected?  (I found
> HDFS-1539 but that seems different)
>  - When are blocks moved from blocksBeingWritten to current?  Does that
> happen before a file close operation is acknowledged to a hdfs client?
>  - Could it be that the DataNodes actually moved the blocks to current but
> after the restart ext3 rewound state somehow (forgive my ignorance of
> underlying file system behavior)?
>  - Is there any other explanation for how this can happen?
>
> Here is a sequence of selected relevant log lines from the RS (HBase
> Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> question).  It includes everything that mentions the block in question in
> the NameNode and one DataNode log.  Please let me know if this more
> information that would be helpful.
>
> RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> Creating
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> with permission=rwxrwxrwx
> NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> blk_1395839728632046111_357084589
> DN 2013-06-29 11:16:06,832 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> 10.0.5.237:50010
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> block blk_1395839728632046111_357084589 terminating
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> Removing lease on  file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> from client DFSClient_hb_rs_hs745,60020,1372470111932
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> RS 2013-06-29 11:16:11,393 INFO
> org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> to
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> RS 2013-06-29 11:16:11,505 INFO
> org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7
> file(s) in n of
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into
> 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
>
> -------  CRASH, RESTART ---------
>
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was
> rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.1:50010
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.5.237:50010
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.24:50010
> (note the clock on the server running DN is wrong after restart.  I
> believe timestamps are off by 6 hours:)
> DN 2013-06-29 06:07:22,877 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_1395839728632046111_357084589 file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> deletion
> DN 2013-06-29 06:07:24,952 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_1395839728632046111_357084589 at file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>
>
> Thanks,
> Dave
>



-- 
http://hortonworks.com/download/

Mime
View raw message