Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 69026 invoked from network); 26 Jun 2007 18:07:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Jun 2007 18:07:51 -0000 Received: (qmail 99265 invoked by uid 500); 26 Jun 2007 18:07:54 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 99012 invoked by uid 500); 26 Jun 2007 18:07:52 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 98999 invoked by uid 99); 26 Jun 2007 18:07:52 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jun 2007 11:07:52 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jun 2007 11:07:47 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4E7117141E0 for ; Tue, 26 Jun 2007 11:07:27 -0700 (PDT) Message-ID: <11915849.1182881247314.JavaMail.jira@brutus> Date: Tue, 26 Jun 2007 11:07:27 -0700 (PDT) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1486) ReplicationMonitor thread goes away In-Reply-To: <721225.1181666665932.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508264 ] Doug Cutting commented on HADOOP-1486: -------------------------------------- > whether to have a monitoring daemon that restarts namenode automatically It seems safe to restart the namenode in this case. I'd simply add a loop to NameNode.main() that creates and starts a new NameNode when the existing namenode exits unexpectedly. We should only restart if it's stopping due to an error, and not due to an explicit call to stop(). So perhaps NameNode#join() could return a boolean indicating whether it's exiting normally or should be restarted, and the catch in the ReplicationMonitor should call a NameNode method to trigger that kind of exit. Does this sound workable? > ReplicationMonitor thread goes away > ------------------------------------ > > Key: HADOOP-1486 > URL: https://issues.apache.org/jira/browse/HADOOP-1486 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.12.3 > Reporter: Koji Noguchi > Assignee: dhruba borthakur > Priority: Blocker > Fix For: 0.14.0 > > Attachments: catchThrowable2.patch > > > Saw many over/under replicated blocks in fsck output. > .out file showed > Exception in thread "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@2785982c" java.lang.IllegalArgumentException: Unexpected non-existing data node: /99.9.99.0/99.9.99.42:99999 > at org.apache.hadoop.net.NetworkTopology.checkArgument(NetworkTopology.java:379) > at org.apache.hadoop.net.NetworkTopology.isOnSameRack(NetworkTopology.java:424) > at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2853) > at org.apache.hadoop.dfs.FSNamesystem$ReplicationTargetChooser.chooseTarget(FSNamesystem.java:2816) > at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2658) > at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1774) > at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1723) > at java.lang.Thread.run(Thread.java:619) > (same as HADOOP-1232) > And, jstack showed no ReplicationMonitor thread. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.