Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 92843 invoked from network); 28 Mar 2008 21:42:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Mar 2008 21:42:43 -0000 Received: (qmail 9997 invoked by uid 500); 28 Mar 2008 21:42:41 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 9510 invoked by uid 500); 28 Mar 2008 21:42:40 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 9501 invoked by uid 99); 28 Mar 2008 21:42:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Mar 2008 14:42:40 -0700 X-ASF-Spam-Status: No, hits=-1998.8 required=10.0 tests=ALL_TRUSTED,FS_REPLICA X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Mar 2008 21:41:58 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 77978234C0AC for ; Fri, 28 Mar 2008 14:40:24 -0700 (PDT) Message-ID: <9125159.1206740424488.JavaMail.jira@brutus> Date: Fri, 28 Mar 2008 14:40:24 -0700 (PDT) From: "Hairong Kuang (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3050) Cluster fall into infinite loop trying to replicate a block to a target that aready has this replica. In-Reply-To: <1823866881.1205956824512.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583204#action_12583204 ] Hairong Kuang commented on HADOOP-3050: --------------------------------------- After examining the log, it looks that we got the following scenario: 1. blk_167544198419718831 was replicated to datanode 1, datanode 2, and datanode 3; 2. Datanode 1 lost contact with the namenode and datanode 2 is scheduled to be decomissioned. 3. Datanode 1 reregistered with the namenode; but the block report came in before its network location was resolved; so its block report was dropped. 4. Because the namenode does not know that datanode 1 has the blk_167544198419718831, it schedules to replicate the block to datanode 1 and datanode 4. 5. The replication of the block failed because it already has the block. 6. No additional block report was received until the end of the log. So the block replication kept on failing. > Cluster fall into infinite loop trying to replicate a block to a target that aready has this replica. > ----------------------------------------------------------------------------------------------------- > > Key: HADOOP-3050 > URL: https://issues.apache.org/jira/browse/HADOOP-3050 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.17.0 > Reporter: Konstantin Shvachko > Assignee: Hairong Kuang > Priority: Blocker > Attachments: FailedTestDecommission.log > > > This happened during a test run by Hudson. So fortunately we have all logs present. > http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1987/console > Search for TestDecommission. And look for block blk_167544198419718831 that is being replicated to node 127.0.0.1:65168 over and over again. > The issue needs to be investigated. I am making it a blocker until it is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.