hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1232) Corrupted block if a crash happens before writing to checksumOut but after writing to dataOut
Date Fri, 10 Sep 2010 07:09:33 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907914#action_12907914
] 

dhruba borthakur commented on HDFS-1232:
----------------------------------------

It actually is a bug because we do not atomically write to the datafile and the metadata file.
Especially in the case when we re-open a file for appending to it, and then we start writing
to last block just when a power failure hits the entire cluster. In this case, after a restart
none of the three replicas might match with their respective checksums.

> Corrupted block if a crash happens before writing to checksumOut but after writing to
dataOut
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1232
>                 URL: https://issues.apache.org/jira/browse/HDFS-1232
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 0.20.1
>            Reporter: Thanh Do
>
> - Summary: block is corrupted if a crash happens before writing to checksumOut but
> after writing to dataOut. 
>  
> - Setup:
> + # available datanodes = 1
> + # disks / datanode = 1
> + # failures = 1
> + failure type = crash
> +When/where failure happens = (see below)
>  
> - Details:
> The order of processing a packet during client write/append at datanode
> is first forward the packet to downstream, then write to data the block file, and 
> and finally, write to the checksum file. Hence if a crash happens BEFORE the write
> to checksum file but AFTER the write to data file, the block is corrupted.
> Worse, if this is the only available replica, the block is lost.
>  
> We also found this problem in case there are 3 replicas for a particular block,
> and during append, there are two failures. (see HDFS-1231)
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and 
> Haryadi Gunawi (haryadi@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message