hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-1225) Block lost when primary crashes in recoverBlock
Date Wed, 23 Jun 2010 02:40:51 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Konstantin Shvachko updated HDFS-1225:
--------------------------------------

    Affects Version/s: 0.20-append
                           (was: 0.20.1)

> Block lost when primary crashes in recoverBlock
> -----------------------------------------------
>
>                 Key: HDFS-1225
>                 URL: https://issues.apache.org/jira/browse/HDFS-1225
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 0.20-append
>            Reporter: Thanh Do
>
> - Summary: Block is lost if primary datanode crashes in the middle tryUpdateBlock.
>  
> - Setup:
> # available datanode = 2
> # replica = 2
> # disks / datanode = 1
> # failures = 1
> # failure type = crash
> When/where failure happens = (see below)
>  
> - Details:
>  Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary.
> Client appends to blk_X_1001 and crash happens during dn1.recoverBlock,
> at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002
> **Interesting**, this case, the block X is lost eventually. Why?
> After dn1.recoverBlock crashes at rename, what left at dn1 current directory is:
> 1) blk_X                                                                            
                                                                                         
                                  
> 2) blk_X_1001.meta_tmp1002
> ==> this is an invalid block, because it has no meta file associated with it.
> dn2 (after dn1 crash) now contains:
> 1) blk_X                                                                            
                                                                                         
                                  
> 2) blk_X_1002.meta
> (note that the rename at dn2 is completed, because dn1 called dn2.updateBlock() before
> calling its own updateBlock())
> But the command namenode.commitBlockSynchronization is not reported to namenode,
> because dn1 is crashed. Therefore, from namenode point of view, the block X has GS 1001.
> Hence, the block is lost.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and 
> Haryadi Gunawi (haryadi@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message