hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walter Su (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile
Date Thu, 22 Oct 2015 15:54:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969341#comment-14969341
] 

Walter Su commented on HDFS-9275:
---------------------------------

I keep digging, then I understand the whole steps:

# When client is writing blockGroup_0, DN1 sends a heartbeat, its xceiverCount=3
# Client finished writing blockGroup_0, blockGroup_1
# Shutdown DN8~10. So idx_6~8 of blockGroup_1 is missing.
# ReplicationMonitor schedules 1st recovery for blockGroup_1, because DN1 is busy(See previous
comments), BlockPlacementPolicy choose DN0,DN11 as targets.
# ErasureCodingWorker recovers idx_6 at DN0, and idx_7 at DN11. (See getTargetIndices() you'll
know why)
# Before idx_6,7 are reported, ReplicationMonitor schedules 2nd recovery for blockGroup_1.
It choose DN0 as targets.
# ErasureCodingWorker tries to recover idx_6 at DN0, it failed because DN0 complains replica
exists.

A delayed heartbeat is the direct cause for the failed tests. The deep cause is, It's not
about the test code, It's about the defects of handling 2 concurrent EC recovery tasks:
# Defect in ReplicationMonitor. It shouldn't choose one DataNode as target twice for the same
block.
# Defect in ErasureCodingWorker. It doesn't know which internal blocks is in recovering, or
recovered. It purely guesses from live nodes.

> Fix TestRecoverStripedFile
> --------------------------
>
>                 Key: HDFS-9275
>                 URL: https://issues.apache.org/jira/browse/HDFS-9275
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: test
>            Reporter: Walter Su
>            Assignee: Walter Su
>         Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message