hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3982) report failed replications in DN heartbeat
Date Thu, 27 Sep 2012 17:58:07 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464924#comment-13464924

Colin Patrick McCabe commented on HDFS-3982:

In step #4, why doesn't the DN receiving the corrupt replica simply receive it and flag it
as corrupt?  Then the block will no longer be on pendingReplications, until the NN notices
that the block needs to be re-replicated because 2 of its 3 replicas are corrupt.  No need
for any special flags or fields?

If it takes us a long time to re-replicate blocks that have only 1 non-corrupt replica, that
seems like a separate problem that we should fix, not hack around?  Unless I'm missing something
> report failed replications in DN heartbeat
> ------------------------------------------
>                 Key: HDFS-3982
>                 URL: https://issues.apache.org/jira/browse/HDFS-3982
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 2.0.2-alpha
>            Reporter: Andy Isaacson
>            Assignee: Andy Isaacson
>            Priority: Minor
> From HDFS-3931:
> {quote}
> # The test corrupts 2/3 replicas.
> # client reports a bad block.
> # NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> # DN notices the incoming replica is corrupt and reports it as a bad block, but does
not inform the NN that re-replication failed.
> # NN keeps the block on pendingReplications.
> # BP scanner wakes up on both DNs with corrupt blocks, both report corruption. NN reports
both as duplicates, one from the client and one from the DN report above.
> since block is on pendingReplications, NN does not schedule another replication.
> Todd wrote:
> I can think of a few ways to fix this:
> ...
>  2) Add a field to the DN heartbeat which reports back a failed replication for a given
block. The NN would use this to decrement the pendingReplication count, which would cause
a new replication attempt to be made if it was still under-replicated.
> This jira tracks implementing the DN heartbeat replication failure report.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message