hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-2472) Extend UnderReplicatedBlocks queue to give blocks whose existing copies are all on a single rack priority over multi-rack blocks
Date Wed, 19 Oct 2011 16:27:10 GMT
Extend UnderReplicatedBlocks queue to give blocks whose existing copies are all on a single
rack priority over multi-rack blocks
--------------------------------------------------------------------------------------------------------------------------------

                 Key: HDFS-2472
                 URL: https://issues.apache.org/jira/browse/HDFS-2472
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: data-node
            Reporter: Steve Loughran
            Assignee: Steve Loughran
            Priority: Minor


Google's Availability in Globally Distributed Storage Systems paper [http://research.google.com/pubs/archive/36737.pdf]
 shows that storage node failures are often correlated with other failures on the same rack.
This could be due to rack-local failures: switches, power, etc, or operations actions (take
down a whole rack for a rolling OS upgrade). Whatever the root cause, the paper argues (section
5.2) that rack-aware placement would increase the MTTF of a single block by a factor of three
(typically). Some decisions can be made a block placement time, but that would be a separate
issue. 

Here I propose giving priority to blocks that are under replicated and where all blocks are
on the same rack above those blocks that are under-replicated and the remaining blocks are
on different racks.

# Provided the failure does not take down the entire rack in one go (e.g. switch failure),
this policy would decrease the time in which all blocks would be on a single rack, so reduce
the consequences of a rack failure. 
# On a single-rack system, all under-replicated blocks would go into this queue, so the state
would effectively be that of today's system.


This may make the demand for off-rack bandwidth ramp up immediately, because priority will
be given to blocks that must be replicated off rack. I am not sure that it will, however,
as multi-rack replication policies would generate the same amount of traffic to re-replicate
the blocks anyway. 

The main barrier to implementing this feature is that currently the UnderReplicatedBlocks
queues do not get provided with any information about block locality. We'd need to change
the add() method to take information about where the current replicas are, and the BlockManager
would have to provide some information as to whether the block was rack-local, not-rack local
or unknown; the latter because the PendingReplicationBlocks structure does not know where
things come from. "unknown" items would have to be given the same priority as rack-only blocks
(pessimistic) or the same priority as multi-rack blocks (optimistic). I would be biased towards
the pessimistic approach as it would ensure that on single-rack systems there would be no
obvious change in behaviour. On a multi-rack system it would give priority to timed out PendingReplication
blocks ahead of multi-rack under-replicated blocks. That may be a good thing in itself


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message