Return-Path: Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: (qmail 5684 invoked from network); 13 Jan 2011 18:06:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Jan 2011 18:06:16 -0000 Received: (qmail 5093 invoked by uid 500); 13 Jan 2011 18:06:15 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 4249 invoked by uid 500); 13 Jan 2011 18:06:10 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 4223 invoked by uid 99); 13 Jan 2011 18:06:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jan 2011 18:06:08 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jan 2011 18:06:06 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id p0DI5jZ2027670 for ; Thu, 13 Jan 2011 18:05:45 GMT Message-ID: <10844157.348321294941945510.JavaMail.jira@thor> Date: Thu, 13 Jan 2011 13:05:45 -0500 (EST) From: "Bryan Duxbury (JIRA)" To: common-dev@hadoop.apache.org Subject: [jira] Created: (HADOOP-7103) When rack awareness script returns nothing, cluster stops working MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org When rack awareness script returns nothing, cluster stops working ----------------------------------------------------------------- Key: HADOOP-7103 URL: https://issues.apache.org/jira/browse/HADOOP-7103 Project: Hadoop Common Issue Type: Bug Reporter: Bryan Duxbury This was an interesting one. Our rack awareness script contains a 1-1 mapping from host/ip to rack. We added a new rack's worth of machines without updating the awareness script, and when the script was called, it returned absolutely no results for the new machines. This resulted in the surprising result that basically the entire cluster stopped working. Even tasks or blocks assigned to nodes with a valid rack seemed to fail. The errors were only detectable by looking in the namenode and jobtracker logs, making it take a while before we could figure out the problem. After fixing the rack awareness script, everything returned to normal operation. It seems to me that either the error should be raised more aggressively, or a "default" rack should be assumed. This would keep simple mistakes from making the entire cluster unusable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.