hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hairong Kuang (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-5465) Blocks remain under-replicated
Date Thu, 12 Mar 2009 21:38:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681107#action_12681107
] 

Hairong Kuang edited comment on HADOOP-5465 at 3/12/09 2:37 PM:
----------------------------------------------------------------

Two bugs in DFS contributed to the problem:
(1). DataNode does not sync on modification to the counter "xmitsInProgress", which keeps
track of the number of replication in progress. When two threads update the counter concurrently,
race condition may occurs. The counter may change to be a non-zero value when no replication
is going on.
(2). Each DN is configured to have at most 2 replications in progress. When DN notifies NN
that it has 1 replication in progress, NN should be able to send one block replication request
to DN. But NN wrongly interprets the counter as the number of targets. When it sees that the
block is scheduled to 2 targets but DN can only take 1, it sends an empty replication request
to DN. As a result, blocking all replications from this DataNode. If the DataNode is the only
source of an under-replicated block, the block will never get replicated.

Fixing either (1) or (2) could fix the problem. I think (1) is more fundamental so I will
fix (1) in this jira and file a different jira to fix (2).

      was (Author: hairong):
    Two bugs in DFS contributed to the problem:
(1). DataNode does not sync on modification to the counter "xmitsInProgress", which keeps
track of the number of replication in progress. When two threads update the counter concurrently,
race condition may occurs. The counter may change to be a non-zero value when no replication
is going on.
(2). Each DN is configured to have at most 2 replications in progress. When DN notifies NN
that it has 1 replication in progress, NN should be able to send one block replication request
to DN. But NN wrongly interprets the counter as the number of targets. When it sees that the
block is scheduled to 2 targets but DN can only take 1, it sends an empty replication request
to DN. As a result, blocking all replication from this DataNode. If the DataNode is the only
source of an under-replicated block, the block will never gets replicated.

Fixing either one or two could fix the problem. I think (1) is more fundamental so I will
fix (1) in this jira and file a different jira to fix (2).
  
> Blocks remain under-replicated
> ------------------------------
>
>                 Key: HADOOP-5465
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5465
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.18.4, 0.19.2, 0.20.0, 0.21.0
>
>         Attachments: xmitsSync1.patch
>
>
> Occasionally we see some blocks remain to be under-replicated in our production clusters.
This is what we obeserved:
> 1. Sometimes when increasing the replication factor of a file, some blocks belonged to
this file do not get to increase to the new replication factor.
> 2. When taking meta save in two different days, some blocks remain in under-replication
queue. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message