hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-15) All replicas of a block end up on only 1 rack
Date Wed, 29 Dec 2010 23:31:48 GMT

     [ https://issues.apache.org/jira/browse/HDFS-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Eli Collins updated HDFS-15:

    Attachment: hdfs-15-b20-1.patch

Here's a patch that applies against branch 20.  The original patch was done after the block
management refactoring so it isn't a straight forward application to 20, so I've written a
new patch, but the fix is in the same spirit as the code in trunk.

The bug is hard to reproduce as you have to have a failure/decommission on the only cross
rack replica of a block in the window of time this block is over-replicated.

The patch adds following new tests that cover rack policy violations not covered by the existing
tests. Some of them fail when looped repeatedly w/o the fix (after commenting out the asserts
that check neededReplications which will always fail). I'll forward port these test to trunk
in another jira. 

* Test that blocks that have a sufficient number of total replicas, but are not replicated
cross rack, get replicated cross rack when a rack becomes available.
* Test that new blocks for an underreplicated file will get replicated cross rack.
* Mark a block as corrupt, test that when it is re-replicated that it is still replicated
across racks.
* Reduce the replication factor of a file, making sure that the only block that is across
racks is not removed when deleting replicas.
* Test that when a block is replicated because a replica is lost due to host failure the the
rack policy is preserved.
* Test that when the execss replicas of a block are reduced due to a node re-joining the cluster
the rack policy is not violated.
* Test that rack policy is still respected when blocks are replicated due to node decommissioning.
* Test that rack policy is still respected when blocks are replicated due to node decommissioning,
even when the blocks are over-replicated.

> All replicas of a block end up on only 1 rack
> ---------------------------------------------
>                 Key: HDFS-15
>                 URL: https://issues.apache.org/jira/browse/HDFS-15
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Hairong Kuang
>            Assignee: Jitendra Nath Pandey
>            Priority: Critical
>             Fix For: 0.20.3, 0.21.0
>         Attachments: hdfs-15-b20-1.patch, HDFS-15.4.patch, HDFS-15.5.patch, HDFS-15.6.patch,
HDFS-15.patch, HDFS-15.patch.2, HDFS-15.patch.3
> HDFS replicas placement strategy guarantees that the replicas of a block exist on at
least two racks when its replication factor is greater than one. But fsck still reports that
the replicas of some blocks  end up on one rack.
> The cause of the problem is that decommission and corruption handling only check the
block's replication factor but not the rack requirement. When an over-replicated block loses
a replica due to decomission, corruption, or heartbeat lost, namenode does not take any action
to guarantee that remaining replicas are on different racks.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message