hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HDFS-1227) UpdateBlock fails due to unmatched file length
Date Wed, 23 Jun 2010 04:53:51 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon resolved HDFS-1227.
-------------------------------

    Resolution: Duplicate

Going to resolve this as invalid. If you can reproduce after HDFS-1186 is committed, or provide
a unit test, we can reopen.

> UpdateBlock fails due to unmatched file length
> ----------------------------------------------
>
>                 Key: HDFS-1227
>                 URL: https://issues.apache.org/jira/browse/HDFS-1227
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 0.20-append
>            Reporter: Thanh Do
>
> - Summary: client append is not atomic, hence, it is possible that
> when retrying during append, there is an exception in updateBlock
> indicating unmatched file length, making append failed.
>  
> - Setup:
> + # available datanodes = 3
> + # disks / datanode = 1
> + # failures = 2
> + failure type = bad disk
> + When/where failure happens = (see below)
> + This bug is non-deterministic, to reproduce it, add a sufficient sleep before out.write()
in BlockReceiver.receivePacket() in dn1 and dn2 but not dn3
>  
> - Details:
>  Suppose client appends 16 bytes to block X which has length 16 bytes at dn1, dn2, dn3.
> Dn1 is primary. The pipeline is dn3-dn2-dn1. recoverBlock succeeds.
> Client starts sending data to the dn3 - the first datanode in pipeline.
> dn3 forwards the packet to downstream datanodes, and starts writing
> data to its disk. Suppose there is an exception in dn3 when writing to disk.
> Client gets the exception, it starts the recovery code by calling dn1.recoverBlock()
again.
> dn1 in turn calls dn2.getMetadataInfo() and dn1.getMetaDataInfo() to build the syncList.
> Suppose at the time getMetadataInfo() is called at both datanodes (dn1 and dn2),
> the previous packet (which is sent from dn3) has not come to disk yet.
> Hence, the block Info given by getMetaDataInfo contains the length of 16 bytes.
> But after that, the packet "comes" to disk, making the block file length now becomes
32 bytes.
> Using the syncList (with contains block info with length 16 byte), dn1 calls updateBlock
at
> dn2 and dn1, which will failed, because the length of new block info (given by updateBlock,
> which is 16 byte) does not match with its actual length on disk (which is 32 byte)
>  
> Note that this bug is non-deterministic. Its depends on the thread interleaving
> at datanodes.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and 
> Haryadi Gunawi (haryadi@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message