From hadoop-dev-return-7973-apmail-lucene-hadoop-dev-archive=lucene.apache.org@lucene.apache.org Thu Feb 15 22:48:27 2007 Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 27834 invoked from network); 15 Feb 2007 22:48:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Feb 2007 22:48:27 -0000 Received: (qmail 59737 invoked by uid 500); 15 Feb 2007 22:48:34 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 59706 invoked by uid 500); 15 Feb 2007 22:48:34 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 59697 invoked by uid 99); 15 Feb 2007 22:48:34 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Feb 2007 14:48:34 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Feb 2007 14:48:26 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D47887141E5 for ; Thu, 15 Feb 2007 14:48:05 -0800 (PST) Message-ID: <30234682.1171579685867.JavaMail.jira@brutus> Date: Thu, 15 Feb 2007 14:48:05 -0800 (PST) From: "dhruba borthakur (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster In-Reply-To: <650567.1155195793928.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473544 ] dhruba borthakur commented on HADOOP-442: ----------------------------------------- Regarding comment 5 above, it actually might make sense to have a separate thread to check whether a decommission is completed or not. It can run on its own schedule. The ReplicationMonitor thread periodically works every 3 seconds and this periodicity is "too" frequent to be checking decommissioned nodes. > slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster > ----------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-442 > URL: https://issues.apache.org/jira/browse/HADOOP-442 > Project: Hadoop > Issue Type: Bug > Components: conf > Reporter: Yoram Arnon > Assigned To: Wendy Chien > Attachments: hadoop-442-10.patch, hadoop-442-8.patch > > > I recently had a few nodes go bad, such that they were inaccessible to ssh, but were still running their java processes. > tasks that executed on them were failing, causing jobs to fail. > I couldn't stop the java processes, because of the ssh issue, so I was helpless until I could actually power down these nodes. > restarting the cluster doesn't help, even when removing the bad nodes from the slaves file - they just reconnect and are accepted. > while we plan to avoid tasks from launching on the same nodes over and over, what I'd like is to be able to prevent rogue processes from connecting to the masters. > Ideally, the slaves file will contain an 'exclude' section, which will list nodes that shouldn't be accessed, and should be ignored if they try to connect. That would also help in configuring the slaves file for a large cluster - I'd list the full range of machines in the cluster, then list the ones that are down in the 'exclude' section -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.