Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 4929 invoked from network); 11 Aug 2006 21:37:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 Aug 2006 21:37:06 -0000 Received: (qmail 96844 invoked by uid 500); 11 Aug 2006 21:37:06 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 96670 invoked by uid 500); 11 Aug 2006 21:37:05 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 96661 invoked by uid 99); 11 Aug 2006 21:37:05 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Aug 2006 14:37:05 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Aug 2006 14:37:03 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id CF8BC71429B for ; Fri, 11 Aug 2006 21:34:15 +0000 (GMT) Message-ID: <26208734.1155332055847.JavaMail.jira@brutus> Date: Fri, 11 Aug 2006 14:34:15 -0700 (PDT) From: "Marco Nicosia (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster In-Reply-To: <650567.1155195793928.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-442?page=comments#action_12427632 ] Marco Nicosia commented on HADOOP-442: -------------------------------------- Would it be better to choose either one or the other to be authoritative for all operations? 1] The namenode/jobtrackers maintain the slaves file. Membership and other administrative functions are made via API calls to the process, which modifies a file on disk. That file is used, but never modified, by slaves.sh, etc. If the file is still text, it can be modified between process restarts. 2] The namenode/jobtrackers observe and respect the contents of a file on disk. Standard tools can modify it, but the processes would have to poll the file to see if it has been changed. I personally prefer #1, tho I'd hope that any API is open (XML-RPC, REST, SOAP...) instead of RMI so that any set of sysadmin automation can talk to it. > slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster > ----------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-442 > URL: http://issues.apache.org/jira/browse/HADOOP-442 > Project: Hadoop > Issue Type: Bug > Reporter: Yoram Arnon > > I recently had a few nodes go bad, such that they were inaccessible to ssh, but were still running their java processes. > tasks that executed on them were failing, causing jobs to fail. > I couldn't stop the java processes, because of the ssh issue, so I was helpless until I could actually power down these nodes. > restarting the cluster doesn't help, even when removing the bad nodes from the slaves file - they just reconnect and are accepted. > while we plan to avoid tasks from launching on the same nodes over and over, what I'd like is to be able to prevent rogue processes from connecting to the masters. > Ideally, the slaves file will contain an 'exclude' section, which will list nodes that shouldn't be accessed, and should be ignored if they try to connect. That would also help in configuring the slaves file for a large cluster - I'd list the full range of machines in the cluster, then list the ones that are down in the 'exclude' section -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira