Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 2072 invoked from network); 18 Feb 2009 00:47:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Feb 2009 00:47:24 -0000 Received: (qmail 44661 invoked by uid 500); 18 Feb 2009 00:47:21 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 44630 invoked by uid 500); 18 Feb 2009 00:47:21 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 44619 invoked by uid 99); 18 Feb 2009 00:47:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Feb 2009 16:47:21 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2009 00:47:20 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8C308234C4AC for ; Tue, 17 Feb 2009 16:47:00 -0800 (PST) Message-ID: <1395042647.1234918020573.JavaMail.jira@brutus> Date: Tue, 17 Feb 2009 16:47:00 -0800 (PST) From: "Raghu Angadi (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3810) NameNode seems unstable on a cluster with little space left In-Reply-To: <1381255878.1216759651593.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674413#action_12674413 ] Raghu Angadi commented on HADOOP-3810: -------------------------------------- another minor fix : every node considered invokes 'getTotalLoad' which obtains heartbeat lock. We should remove this lock (either with a volatile, or just accepting a slightly stale value). > NameNode seems unstable on a cluster with little space left > ----------------------------------------------------------- > > Key: HADOOP-3810 > URL: https://issues.apache.org/jira/browse/HADOOP-3810 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.17.1 > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Attachments: simon-namenode.PNG > > > NameNode seems not very responsive and unstable when the cluster has very little space left. The clients timeout. The main problem is that it is not clear to the user what is going on. Once I have more details about a NameNode that was in this state, I will fill in here. > If there is not enough space left on a cluster, it is ok for clients to receive something like "DiskOutOfSpace" exception. > Right now it looks like NameNode tries too hard find a node with any space left and ends up being slow to respond to clients. If the CPU taken by chooseTarger() is the main cause, there are two possible fixes : > # chooseTarget() iterates and takes quite a bit of CPU for allocating datanodes. Usually this not much of a problem. It takes even more cpu when it needs to search multiple racks for a datanode. We could probably reduce some CPU for these searches. The benefit should be measurable. > # Once NameNode can not find any datanode that has space on a rack, it could mark the rack as "full" and skip searching the rack for next one minute or so. This flag gets cleared after a minute or if any new node is added to the rack. > #* Of course, this might not be optimal w.r.t disk space usage.. but only for a short duration. Once a cluster is mostly full, the user does expect errors. > #* On the flip side, this fix does not require extremely CPU optimized version of chooseTarget(). > #* I think it is reasonable for NameNode to throw DiskOutOfSpace exception, even though it could have found space if it searched much more extensively. > --- > edit : minor changes > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.