hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1557) Deletion of excess replicas should prefer to delete corrupted replicas before deleting valid replicas
Date Tue, 03 Jul 2007 17:54:04 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509957

Doug Cutting commented on HADOOP-1557:

> the bug I was pointing to occurs when setReplication() is called to decrease the number
of replicas

Sorry it's taken me so long to understand this!  Yes, I see the issue now.  I'm not sure I
yet have great sympathy for it.  In general, as one decreases the number of replicas, the
chances that all of them may be corrupt increases.  After HADOOP-1134 we should primarily
only see corruptions due to disk errors.  A disk can start failing at any time.  Validating
some replicas before others are removed would somewhat reduce the chances that all replicas
are corrupt, but not dramatically, so I'm not convinced it's worth the expense.

Disk errors are not entirely random.  When we see a single error from a disk, we're likely
to see more from that disk.  So keeping statistics of the number of corruptions identified
per datanode would be very valuable.  And automatically taking datanodes offline when corruptions
exceed some threshold might go farther towards addressing this issue than explicitly checking
blocks as replication thresholds are reduced, since this would remove replicas from failing
drives *before* they're read, replicated, de-replicated, etc.

> Deletion of excess replicas should prefer to delete corrupted replicas before deleting
valid replicas
> -----------------------------------------------------------------------------------------------------
>                 Key: HADOOP-1557
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1557
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
> Suppose a block has three replicas and two of the replicas are corrupted. If the replication
factor of the file is reduced to 2. The filesystem should preferably delete the two corrupted
replicas, otherwise it could lead to a corrupted file.
> One option would be to make the datanode periodically validate all blocks with their
corresponding CRCs. The other option would be to make the setReplication call validate existing
replicas before deleting excess replicas.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message