hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hairong Kuang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3050) Cluster fall into infinite loop trying to replicate a block to a target that aready has this replica.
Date Fri, 28 Mar 2008 21:40:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583204#action_12583204
] 

Hairong Kuang commented on HADOOP-3050:
---------------------------------------

After examining the log, it looks that we got the following scenario:
1. blk_167544198419718831 was replicated to datanode 1, datanode 2, and datanode 3;
2. Datanode 1 lost contact with the namenode and datanode 2 is scheduled to be decomissioned.
3. Datanode 1 reregistered with the namenode; but the block report came in before its network
location was resolved; so its block report was dropped.
4. Because the namenode does not know that datanode 1 has the blk_167544198419718831, it schedules
to replicate the block to datanode 1 and datanode 4.
5. The replication of the block failed because it already has the block.
6. No additional block report was received until the end of the log. So the block replication
kept on failing. 

> Cluster fall into infinite loop trying to replicate a block to a target that aready has
this replica.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3050
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3050
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.0
>            Reporter: Konstantin Shvachko
>            Assignee: Hairong Kuang
>            Priority: Blocker
>         Attachments: FailedTestDecommission.log
>
>
> This happened during a test run by Hudson. So fortunately we have all logs present.
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1987/console
> Search for TestDecommission. And look for block blk_167544198419718831 that is being
replicated to node 127.0.0.1:65168 over and over again.
> The issue needs to be investigated. I am making it a blocker until it is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message