Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 36097 invoked from network); 21 Mar 2007 00:20:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Mar 2007 00:20:54 -0000 Received: (qmail 57563 invoked by uid 500); 21 Mar 2007 00:21:01 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 57538 invoked by uid 500); 21 Mar 2007 00:21:01 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 57529 invoked by uid 99); 21 Mar 2007 00:21:01 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Mar 2007 17:21:01 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Mar 2007 17:20:52 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8EEA871407C for ; Tue, 20 Mar 2007 17:20:32 -0700 (PDT) Message-ID: <32627049.1174436432582.JavaMail.jira@brutus> Date: Tue, 20 Mar 2007 17:20:32 -0700 (PDT) From: "Hairong Kuang (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-1073) DFS Scalability: high CPU usage in choosing replication targets and file open In-Reply-To: <14179372.1173223644113.JavaMail.root@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hairong Kuang updated HADOOP-1073: ---------------------------------- Attachment: getDistance.patch This patch optimizes the performance of getDistance by changing clusterMap data structures. I will submit a seperate patch to incorprate other suggestions. > DFS Scalability: high CPU usage in choosing replication targets and file open > ----------------------------------------------------------------------------- > > Key: HADOOP-1073 > URL: https://issues.apache.org/jira/browse/HADOOP-1073 > Project: Hadoop > Issue Type: Bug > Components: dfs > Reporter: dhruba borthakur > Assigned To: Hairong Kuang > Attachments: getDistance.patch > > > I have a test cluster that has about 1600 data nodes. randomWriter fails to run because of map tasks fail with "connection timeout" message. The namenode quickly gets to 100% CPU usage. > The positives first: > 1. Datanodes continue to heartbeat and there are no cascading failures. > 2. chooseRandom() does not use much CPU and is very lightweight. > An analysis of the namenode shows the following: > 1. High CPU usage in FSNamesystem.getPipeline(). > 2. Moderate CPU usage in FSNamesystem.sortByDistance(). > The first one is used by chooseTarget() to sort a list of target-datanodes based on their distances from the writer. The second one is used by an open() call to arrange the list of datanodes so that the datanode that is closest to the reader is first in the list. > I have two proposals to address this problem. Please comment. > Proposal 1: Optimize getDistance() > -------------- > In the current implementation, each datanode has a network path associated with it. For example "/default-rack/74.6.138.207:50010". The method getDistance() splits the network-pathname (using "/") and then does string-compares to determine the nearest common ancestor of two given nodes. One optimization would be to avoid string splits and comparisions while determining distance between two nodes. > Instead, we can maintain the "height" at which a node is located in the network topology tree. The root node being at heigth 0. Also, from each InnerNode we maintain a direct reference to the parent node. If the two nodes are at the same height, send each node to its parent until we reach a common parent. Thus the distance between the two nodes is 2x where x is the distance to the common parent. If the nodes are at different depths to begin with, then repeatedly send the node at a greater height to its parent until the nodes are at the same height, and then continue as before. > Also, the calls to check checkArgument() from getDistance() may be removed. > Also, the call to getPipeline() may be done outside the global FSNamesystem lock. > Proposal 2: Distribute the workload to the DFSClient > --------------- > The namenode downloads the network topology to a dfsclient. The dfsclient caches it in memory. When a new block needs to be allocated, the namenode sends a list of unsorted datanodes to the client. The client sorts them based on the cached network topology map. Similarly, when a file is opened, the namenode sends the list of unsorted blocks that comprise this file. The dfsclient sorts them and uses them appropriately. The topology map can be compacted into maybe a 1Mb buffer for a 10000 node system. > If the network topology is very big, then another option would be to have a set of toppology servers (that has a cached copy of the network topology) and the dfsclient contacts one of them to sort its list of target datanodes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.