Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 25061 invoked from network); 2 Oct 2007 21:23:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Oct 2007 21:23:12 -0000 Received: (qmail 63500 invoked by uid 500); 2 Oct 2007 21:23:01 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 63458 invoked by uid 500); 2 Oct 2007 21:23:01 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 63448 invoked by uid 99); 2 Oct 2007 21:23:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Oct 2007 14:23:01 -0700 X-ASF-Spam-Status: No, hits=-98.8 required=10.0 tests=ALL_TRUSTED,FS_REPLICA X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Oct 2007 21:23:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id BADF9714209 for ; Tue, 2 Oct 2007 14:22:50 -0700 (PDT) Message-ID: <79753.1191360170762.JavaMail.jira@brutus> Date: Tue, 2 Oct 2007 14:22:50 -0700 (PDT) From: "Koji Noguchi (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1955) Corrupted block replication retries for ever In-Reply-To: <8123240.1190917130811.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531915 ] Koji Noguchi commented on HADOOP-1955: -------------------------------------- bq. Koji, as a crude work around, could you try reading the file ? If reading succeeds, you could just manually remove the courrupt source block. Thanks Raghu. I haven't done this yet, but yes, this should work. > Corrupted block replication retries for ever > -------------------------------------------- > > Key: HADOOP-1955 > URL: https://issues.apache.org/jira/browse/HADOOP-1955 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.14.1 > Reporter: Koji Noguchi > Assignee: Raghu Angadi > Priority: Blocker > Fix For: 0.14.2 > > Attachments: HADOOP-1955.patch > > > When replicating corrupted block, receiving side rejects the block due to checksum error. Namenode keeps on retrying (with the same source datanode). > Fsck shows those blocks as under-replicated. > [Namenode log] > {noformat} > 2007-09-27 02:00:05,273 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 99.2.99.111 > ... > 2007-09-27 02:01:02,618 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.37:9999 > 2007-09-27 02:10:03,843 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_-5925066143536023890 > 2007-09-27 02:10:08,248 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.35:9999 > 2007-09-27 02:20:03,848 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_-5925066143536023890 > 2007-09-27 02:20:08,646 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.19:9999 > (repeats) > {noformat} > [Datanode(sender) 99.9.99.11 log] > {noformat} > 2007-09-27 02:01:04,493 INFO org.apache.hadoop.dfs.DataNode: Starting thread to transfer block blk_-5925066143536023890 to [Lorg.apache.hadoop.dfs.DatanodeInfo;@e58187 > 2007-09-27 02:01:05,153 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-5925066143536023890 to 74.6.128.37:50010 got java.net.SocketException: Connection reset > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) > at java.net.SocketOutputStream.write(SocketOutputStream.java:136) > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231) > at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280) > at java.lang.Thread.run(Thread.java:619) > (repeats) > {noformat} > [Datanode(one of the receiver) 99.9.99.37 log] > {noformat} > 2007-09-27 02:01:05,150 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Unexpected checksum mismatch while writing blk_-5925066143536023890 from /74.6.128.33:57605 > at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:902) > at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727) > at java.lang.Thread.run(Thread.java:619) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.