Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 7932 invoked from network); 22 Dec 2010 04:30:28 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Dec 2010 04:30:28 -0000 Received: (qmail 56256 invoked by uid 500); 22 Dec 2010 04:30:28 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 56154 invoked by uid 500); 22 Dec 2010 04:30:28 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 56146 invoked by uid 99); 22 Dec 2010 04:30:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Dec 2010 04:30:27 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Dec 2010 04:30:24 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oBM4U33x005570 for ; Wed, 22 Dec 2010 04:30:03 GMT Message-ID: <9215235.258671292992203297.JavaMail.jira@thor> Date: Tue, 21 Dec 2010 23:30:03 -0500 (EST) From: "Hadoop QA (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Commented: (HDFS-1105) Balancer improvement MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974046#action_12974046 ] Hadoop QA commented on HDFS-1105: --------------------------------- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12456072/HDFS-1105.4.patch against trunk revision 1051669. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestStorageRestore org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestHDFSTrash -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/27//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/27//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/27//console This message is automatically generated. > Balancer improvement > -------------------- > > Key: HDFS-1105 > URL: https://issues.apache.org/jira/browse/HDFS-1105 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer > Reporter: Dmytro Molkov > Assignee: Dmytro Molkov > Fix For: 0.23.0 > > Attachments: HDFS-1105.2.patch, HDFS-1105.3.patch, HDFS-1105.4.patch, HDFS-1105.patch > > > We were seeing some weird issues with the balancer in our cluster: > 1) it can get stuck during an iteration and only restarting it helps > 2) the iterations are highly inefficient. With 20 minutes iteration it moves 7K blocks a minute for the first 6 minutes and hundreds of blocks in the next 14 minutes > 3) it can hit namenode and the network pretty hard > A few improvements we came up with as a result: > Making balancer more deterministic in terms of running time of iteration, improving the efficiency and making the load configurable: > Make many of the constants configurable command line parameters: Iteration length, number of blocks to move in parallel to a given node and in cluster overall. > Terminate transfers that are still in progress after iteration is over. > Previously iteration time was the time window in which the balancer was scheduling the moves and then it would wait for the moves to finish indefinitely. Each scheduling task can run up to iteration time or even longer. This means if you have too many of them and they are long your actual iterations are longer than 20 minutes. Now each scheduling task has a time of the start of iteration and it should schedule the moves only if it did not run out of time. So the tasks that have started after the iteration is over will not schedule any moves. > The number of move threads and dispatch threads is configurable so that depending on the load of the cluster you can run it slower. > I will attach a patch, please let me know what you think and what can be done better. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.