Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 80861 invoked from network); 30 Apr 2007 23:46:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Apr 2007 23:46:37 -0000 Received: (qmail 93255 invoked by uid 500); 30 Apr 2007 23:46:42 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 93234 invoked by uid 500); 30 Apr 2007 23:46:42 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 93225 invoked by uid 99); 30 Apr 2007 23:46:42 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Apr 2007 16:46:42 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Apr 2007 16:46:35 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 77DB1714049 for ; Mon, 30 Apr 2007 16:46:15 -0700 (PDT) Message-ID: <8810053.1177976775487.JavaMail.jira@brutus> Date: Mon, 30 Apr 2007 16:46:15 -0700 (PDT) From: "Michael Bieniosek (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1309) DFS logging in NameSystem.pendingTransfer consumes all disk space In-Reply-To: <30277417.1177967475313.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492817 ] Michael Bieniosek commented on HADOOP-1309: ------------------------------------------- Here's another one from trying to add a new node to my cluster: 2007-04-30 23:10:18,040 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from x.y.z.237:50010 2007-04-30 23:10:18,040 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/x.y.z.237:50010 2007-04-30 23:10:18,040 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from x.y.z.237:50010 2007-04-30 23:10:18,040 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/x.y.z.237:50010 2007-04-30 23:10:18,040 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from x.y.z.237:50010 > DFS logging in NameSystem.pendingTransfer consumes all disk space > ----------------------------------------------------------------- > > Key: HADOOP-1309 > URL: https://issues.apache.org/jira/browse/HADOOP-1309 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.12.3 > Reporter: Michael Bieniosek > > Sometimes the namenode goes crazy. I see this in my logs: > 2007-04-28 02:40:46,992 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask x.y.z.243:50010 to replicate blk_-9064654741761822118 to datanode(s) x.y.z.247:50010 > 2007-04-28 02:40:46,992 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask x.y.z.243:50010 to replicate blk_-8996500637974689840 to datanode(s) x.y.yz.225:50010 > 2007-04-28 02:40:46,992 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask x.y.z.227:50010 to replicate blk_-8870980160272831217 to datanode(s) x.y.z.244:50010 > 2007-04-28 02:40:46,992 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask x.y.z.227:50010 to replicate blk_-8721101562083234290 to datanode(s) x.y.z.250:50010 > 2007-04-28 02:40:46,992 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask x.y.z.250:50010 to replicate blk_-9044741671491162229 to datanode(s) x.y.z.244:50010 > There are on the order of 10k/sec until the machine runs out of disk space. > I notice that in FSNamesystem.java, about 10 lines above this line is logged, there is a comment: > // > // Move the block-replication into a "pending" state. > // The reason we use 'pending' is so we can retry > // replications that fail after an appropriate amount of time. > // (REMIND - mjc - this timer is not yet implemented.) > // -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.