hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Derek Dagit (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-4366) Block Replication Policy Implementation May Skip Higher-Priority Blocks for Lower-Priority Blocks
Date Tue, 08 Jan 2013 23:22:12 GMT

     [ https://issues.apache.org/jira/browse/HDFS-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Derek Dagit updated HDFS-4366:
------------------------------

    Attachment: hdfs-4366-unittest.patch

My initial thought is to encapsulate the priority indices inside UnderReplicatedBlocks, which
is where the priority queue and indices live anyway.

We could also guarantee the appropriate index is decremented properly on each call to remove.

I do not think we can know in most cases whether a particular block lies to the left or right
of the index since the random look-up of blocks is implemented as a hash, whereas the index
is an index into a doubly-linked list.  We would have to walk from the head or tail of the
doubly-linked list to find the answer.

Also, decrementing when we do not have to is not dangerous, since at worst it means we re-process
a block that we would not have had to otherwise.  But we should also make sure to clamp the
index at 0 to avoid unnecessary processing.  Currently with the patch, the index can go negative.

Comments welcome

                
> Block Replication Policy Implementation May Skip Higher-Priority Blocks for Lower-Priority
Blocks
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4366
>                 URL: https://issues.apache.org/jira/browse/HDFS-4366
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 1.1.1, 3.0.0, 0.23.5
>            Reporter: Derek Dagit
>            Assignee: Derek Dagit
>         Attachments: hdfs-4366-unittest.patch
>
>
> In certain cases, higher-priority under-replicated blocks can be skipped by the replication
policy implementation.  The current implementation maintains, for each priority level, an
index into a list of blocks that are under-replicated.  Together, the lists compose a priority
queue (see note later about branch-0.23).  In some cases when blocks are removed from a list,
the caller (BlockManager) properly handles the index into the list from which it removed a
block.  In some other cases, the index remains stationary while the list changes.  Whenever
this happens, and the removed block happened to be at or before the index, the implementation
will skip over a block when selecting blocks for replication work.
> In situations when entire racks are decommissioned, leading to many under-replicated
blocks, loss of blocks can occur.
> Background: HDFS-1765
> This patch to trunk greatly improved the state of the replication policy implementation.
 Prior to the patch, the following details were true:
> 	* The block "priority queue" was no such thing: It was really set of trees that held
blocks in natural ordering, that being by the blocks ID, which resulted in iterator walks
over the blocks in pseudo-random order.
> 	* There was only a single index into an iteration over all of the blocks...
> 	* ... meaning the implementation was only successful in respecting priority levels on
the first pass.  Overall, the behavior was a round-robin-type scheduling of blocks.
> After the patch
> 	* A proper priority queue is implemented, preserving log n operations while iterating
over blocks in the order added.
> 	* A separate index for each priority is key is kept...
> 	* ... allowing for processing of the highest priority blocks first regardless of which
priority had last been processed.
> The change was suggested for branch-0.23 as well as trunk, but it does not appear to
have been pulled in.
> The problem:
> Although the indices are now tracked in a better way, there is a synchronization issue
since the indices are managed outside of methods to modify the contents of the queue.
> Removal of a block from a priority level without adjusting the index can mean that the
index then points to the block after the block it originally pointed to.  In the next round
of scheduling for that priority level, the block originally pointed to by the index is skipped.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message