hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Bo (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9826) Erasure Coding: Postpone the recovery work for a configurable time period
Date Wed, 09 Mar 2016 04:00:42 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Li Bo updated HDFS-9826:
    Attachment: HDFS-9826-002.patch

>  Erasure Coding: Postpone the recovery work for a configurable time period
> --------------------------------------------------------------------------
>                 Key: HDFS-9826
>                 URL: https://issues.apache.org/jira/browse/HDFS-9826
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Li Bo
>            Assignee: Li Bo
>         Attachments: HDFS-9826-001.patch, HDFS-9826-002.patch
> Currently NameNode prepares recovering when finding an under replicated  block group.
This is inefficient and reduces resources for other operations. It would be better to postpone
the recovery work for a period of time if only one internal block is corrupted considering
points shown by papers such as \[1\]\[2\]:
> 1.	Transient errors in which no data are lost account for more than 90% of data center
failures, owing to network partitions, software problems, or non-disk hardware faults.
> 2.	Although erasure codes tolerate multiple simultaneous failures, single failures represent
99.75% of recoveries.
> Different clusters may have different status, so we should allow user to configure the
time for postponing the recoveries. Proper configuration will reduce a large proportion of
unnecessary recoveries. When finding multiple internal blocks corrupted in a block group,
we prepare the recovery work immediately because it’s very rare and we don’t want to increase
the risk of losing data.
> [1] Availability in globally distributed storage systems
> http://static.usenix.org/events/osdi10/tech/full_papers/Ford.pdf
> [2] Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and
degraded reads
> http://static.usenix.org/events/fast/tech/full_papers/Khan.pdf

This message was sent by Atlassian JIRA

View raw message