Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 54448 invoked from network); 26 May 2010 21:04:13 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 May 2010 21:04:13 -0000 Received: (qmail 32200 invoked by uid 500); 26 May 2010 21:04:13 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 32160 invoked by uid 500); 26 May 2010 21:04:13 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 32152 invoked by uid 99); 26 May 2010 21:04:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 May 2010 21:04:13 +0000 X-ASF-Spam-Status: No, hits=-1469.0 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 May 2010 21:04:12 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o4QL3qUC015380 for ; Wed, 26 May 2010 21:03:52 GMT Message-ID: <9227945.8431274907832743.JavaMail.jira@thor> Date: Wed, 26 May 2010 17:03:52 -0400 (EDT) From: "Dmytro Molkov (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Updated: (HDFS-599) Improve Namenode robustness by prioritizing datanode heartbeats over client requests In-Reply-To: <1801387541.1252136877482.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmytro Molkov updated HDFS-599: ------------------------------- Attachment: HDFS-599.3.patch Please have a look. I addressed Hairong's comments for the previous patch. I will create additional Jiras for the rest of the comments in the conversation. @Hairong as far as TestDistributedFileSystem is concerned it was more of a problem of svn diff command. The actual change is really small. I added one more test case which reruns other testcases with service port on. A little is done to make it work, like each testcase instead of constructing new HdfsConfiguration calls a method that based on the dualPortTesting boolean flag creates a conf with service port configuration turned on. > Improve Namenode robustness by prioritizing datanode heartbeats over client requests > ------------------------------------------------------------------------------------ > > Key: HDFS-599 > URL: https://issues.apache.org/jira/browse/HDFS-599 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Reporter: dhruba borthakur > Assignee: Dmytro Molkov > Attachments: HDFS-599.3.patch, HDFS-599.patch > > > The namenode processes RPC requests from clients that are reading/writing to files as well as heartbeats/block reports from datanodes. > Sometime, because of various reasons (Java GC runs, inconsistent performance of NFS filer that stores HDFS transacttion logs, etc), the namenode encounters transient slowness. For example, if the device that stores the HDFS transaction logs becomes sluggish, the Namenode's ability to process RPCs slows down to a certain extent. During this time, the RPCs from clients as well as the RPCs from datanodes suffer in similar fashion. If the underlying problem becomes worse, the NN's ability to process a heartbeat from a DN is severly impacted, thus causing the NN to declare that the DN is dead. Then the NN starts replicating blocks that used to reside on the now-declared-dead datanode. This adds extra load to the NN. Then the now-declared-datanode finally re-establishes contact with the NN, and sends a block report. The block report processing on the NN is another heavyweight activity, thus casing more load to the already overloaded namenode. > My proposal is tha the NN should try its best to continue processing RPCs from datanodes and give lesser priority to serving client requests. The Datanode RPCs are integral to the consistency and performance of the Hadoop file system, and it is better to protect it at all costs. This will ensure that NN recovers from the hiccup much faster than what it does now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.