hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-2742) HA: observed dataloss in replication stress test
Date Tue, 03 Jan 2012 04:52:21 GMT
HA: observed dataloss in replication stress test

                 Key: HDFS-2742
                 URL: https://issues.apache.org/jira/browse/HDFS-2742
             Project: Hadoop HDFS
          Issue Type: Sub-task
          Components: data-node, ha, name-node
    Affects Versions: HA branch (HDFS-1623)
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon
            Priority: Blocker
         Attachments: log-colorized.txt

The replication stress test case failed over the weekend since one of the replicas went missing.
Still diagnosing the issue, but it seems like the chain of events was something like:
- a block report was generated on one of the nodes while the block was being written - thus
the block report listed the block as RBW
- when the standby replayed this queued message, it was replayed after the file was marked
complete. Thus it marked this replica as corrupt
- it asked the DN holding the corrupt replica to delete it. And, I think, removed it from
the block map at this time.
- That DN then did another block report before receiving the deletion. This caused it to be
re-added to the block map, since it was "FINALIZED" now.
- Replication was lowered on the file, and it counted the above replica as non-corrupt, and
asked for the other replicas to be deleted.
- All replicas were lost.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message