Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 17929 invoked from network); 24 Jan 2008 00:26:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Jan 2008 00:26:59 -0000 Received: (qmail 72000 invoked by uid 500); 24 Jan 2008 00:26:48 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 71959 invoked by uid 500); 24 Jan 2008 00:26:48 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 71950 invoked by uid 500); 24 Jan 2008 00:26:48 -0000 Delivered-To: apmail-lucene-hadoop-dev@lucene.apache.org Received: (qmail 71947 invoked by uid 99); 24 Jan 2008 00:26:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Jan 2008 16:26:48 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jan 2008 00:26:42 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8B941714187 for ; Wed, 23 Jan 2008 16:26:34 -0800 (PST) Message-ID: <2707152.1201134394569.JavaMail.jira@brutus> Date: Wed, 23 Jan 2008 16:26:34 -0800 (PST) From: "Arun C Murthy (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-2691) Some junit tests fail with the exception: All datanodes are bad. Aborting... In-Reply-To: <13712770.1201126654116.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561888#action_12561888 ] Arun C Murthy commented on HADOOP-2691: --------------------------------------- I see this problem even on a large cluster running the sort benchmark: {noformat}2008-01-23 23:31:50,212 WARN org.apache.hadoop.fs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_1708782005609649024java.io.IOException: Bad response 1 for block blk_1708782005609649024 from datanode XXX.YYY.43.208:51582 at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:1750) 2008-01-23 23:31:50,226 INFO org.apache.hadoop.fs.DFSClient: Closing old block blk_1708782005609649024 2008-01-23 23:31:50,226 WARN org.apache.hadoop.fs.DFSClient: Error Recovery for block blk_1708782005609649024 bad datanode[1] XXX.YYY.43.208:51582 2008-01-23 23:31:50,227 WARN org.apache.hadoop.fs.DFSClient: Error Recovery for block blk_1708782005609649024 bad datanode XXX.YYY.43.208:51582 2008-01-23 23:31:50,227 INFO org.apache.hadoop.fs.DFSClient: pipeline = XXX.YYY.44.144:58986 2008-01-23 23:31:50,227 INFO org.apache.hadoop.fs.DFSClient: pipeline = XXX.YYY.44.140:55589 2008-01-23 23:31:50,227 INFO org.apache.hadoop.fs.DFSClient: Connecting to XXX.YYY.44.144:58986 2008-01-23 23:31:50,285 INFO org.apache.hadoop.fs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink XXX.YYY.44.140:55589 2008-01-23 23:31:50,285 WARN org.apache.hadoop.fs.DFSClient: Error Recovery for block blk_1708782005609649024 bad datanode XXX.YYY.44.144:58986 2008-01-23 23:31:50,285 INFO org.apache.hadoop.fs.DFSClient: pipeline = XXX.YYY.44.140:55589 2008-01-23 23:31:50,285 INFO org.apache.hadoop.fs.DFSClient: Connecting to XXX.YYY.44.140:55589 2008-01-23 23:31:50,309 INFO org.apache.hadoop.fs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2008-01-23 23:31:50,346 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: All datanodes are bad. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1831) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1479) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1571) {noformat} > Some junit tests fail with the exception: All datanodes are bad. Aborting... > ---------------------------------------------------------------------------- > > Key: HADOOP-2691 > URL: https://issues.apache.org/jira/browse/HADOOP-2691 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.15.2 > Reporter: Hairong Kuang > Assignee: dhruba borthakur > Fix For: 0.16.0 > > Attachments: datanodesBad.patch > > > Some junit tests fail with the following exception: > java.io.IOException: All datanodes are bad. Aborting... > at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1831) > at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1479) > at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1571) > The log contains the following message: > 2008-01-19 23:00:25,557 INFO dfs.StateChange (FSNamesystem.java:allocateBlock(1274)) - BLOCK* NameSystem.allocateBlock: /srcdat/three/3189919341591612220. blk_6989304691537873255 > 2008-01-19 23:00:25,559 INFO fs.DFSClient (DFSClient.java:createBlockOutputStream(1982)) - pipeline = 127.0.0.1:40678 > 2008-01-19 23:00:25,559 INFO fs.DFSClient (DFSClient.java:createBlockOutputStream(1982)) - pipeline = 127.0.0.1:40680 > 2008-01-19 23:00:25,559 INFO fs.DFSClient (DFSClient.java:createBlockOutputStream(1985)) - Connecting to 127.0.0.1:40678 > 2008-01-19 23:00:25,570 INFO dfs.DataNode (DataNode.java:writeBlock(1084)) - Receiving block blk_6989304691537873255 from /127.0.0.1 > 2008-01-19 23:00:25,572 INFO dfs.DataNode (DataNode.java:writeBlock(1084)) - Receiving block blk_6989304691537873255 from /127.0.0.1 > 2008-01-19 23:00:25,573 INFO dfs.DataNode (DataNode.java:writeBlock(1169)) - Datanode 0 forwarding connect ack to upstream firstbadlink is > 2008-01-19 23:00:25,573 INFO dfs.DataNode (DataNode.java:writeBlock(1150)) - Datanode 1 got response for connect ack from downstream datanode with firstbadlink as > 2008-01-19 23:00:25,573 INFO dfs.DataNode (DataNode.java:writeBlock(1169)) - Datanode 1 forwarding connect ack to upstream firstbadlink is > 2008-01-19 23:00:25,574 INFO dfs.DataNode (DataNode.java:lastDataNodeRun(1802)) - Received block blk_6989304691537873255 of size 34 from /127.0.0.1 > 2008-01-19 23:00:25,575 INFO dfs.DataNode (DataNode.java:lastDataNodeRun(1819)) - PacketResponder 0 for block blk_6989304691537873255 terminating > 2008-01-19 23:00:25,575 INFO dfs.StateChange (FSNamesystem.java:addStoredBlock(2467)) - BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:40680 is added to blk_6989304691537873255 size 34 > 2008-01-19 23:00:25,575 INFO dfs.DataNode (DataNode.java:close(2013)) - BlockReceiver for block blk_6989304691537873255 waiting for last write to drain. > 2008-01-19 23:01:31,577 WARN fs.DFSClient (DFSClient.java:run(1764)) - DFSOutputStream ResponseProcessor exception for block blk_6989304691537873255java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:129) > at java.io.DataInputStream.readFully(DataInputStream.java:176) > at java.io.DataInputStream.readLong(DataInputStream.java:380) > at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:1726) > 2008-01-19 23:01:31,578 INFO fs.DFSClient (DFSClient.java:run(1653)) - Closing old block blk_6989304691537873255 > 2008-01-19 23:01:31,579 WARN fs.DFSClient (DFSClient.java:processDatanodeError(1803)) - Error Recovery for block blk_6989304691537873255 bad datanode[0] 127.0.0.1:40678 > 2008-01-19 23:01:31,580 WARN fs.DFSClient (DFSClient.java:processDatanodeError(1836)) - Error Recovery for block blk_6989304691537873255 bad datanode 127.0.0.1:40678 > 2008-01-19 23:01:31,580 INFO fs.DFSClient (DFSClient.java:createBlockOutputStream(1982)) - pipeline = 127.0.0.1:40680 > 2008-01-19 23:01:31,580 INFO fs.DFSClient (DFSClient.java:createBlockOutputStream(1985)) - Connecting to 127.0.0.1:40680 > 2008-01-19 23:01:31,582 INFO dfs.DataNode (DataNode.java:writeBlock(1084)) - Receiving block blk_6989304691537873255 from /127.0.0.1 > 2008-01-19 23:01:31,584 INFO dfs.DataNode (DataNode.java:writeBlock(1196)) - writeBlock blk_6989304691537873255 received exception java.io.IOException: Reopen Block blk_6989304691537873255 is valid, and cannot be written to. > 2008-01-19 23:01:31,584 ERROR dfs.DataNode (DataNode.java:run(997)) - 127.0.0.1:40680:DataXceiver: java.io.IOException: Reopen Block blk_6989304691537873255 is valid, and cannot be written to. > at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:613) > at org.apache.hadoop.dfs.DataNode$BlockReceiver.(DataNode.java:1996) > at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1109) > at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:982) > at java.lang.Thread.run(Thread.java:595) > 2008-01-19 23:01:31,585 INFO fs.DFSClient (DFSClient.java:createBlockOutputStream(2024)) - Exception in createBlockOutputStream java.io.EOFException > The log shows that blk_6989304691537873255 was successfully written to two datanodes. But dfsclient timed out waiting for a response from the first datanode. It tried to recover from the failure by resending the data to the second datanode. However, the recovery failed because the second datanode threw an IOException when it detected that it already had the block. It would be nice that the second datanode does not throw an exception for a finalized block during a recovery. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.