Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B2E6818825 for ; Tue, 18 Aug 2015 01:45:46 +0000 (UTC) Received: (qmail 76396 invoked by uid 500); 18 Aug 2015 01:45:46 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 76345 invoked by uid 500); 18 Aug 2015 01:45:46 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 76331 invoked by uid 99); 18 Aug 2015 01:45:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2015 01:45:46 +0000 Date: Tue, 18 Aug 2015 01:45:46 +0000 (UTC) From: "Tsz Wo Nicholas Sze (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-8278) HDFS Balancer should consider remaining storage % when checking for under-utilized machines MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-8278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-8278: -------------------------------------- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Thanks Jing for reviewing the patch. I have committed this. > HDFS Balancer should consider remaining storage % when checking for under-utilized machines > ------------------------------------------------------------------------------------------- > > Key: HDFS-8278 > URL: https://issues.apache.org/jira/browse/HDFS-8278 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: balancer & mover > Affects Versions: 2.8.0 > Reporter: Gopal V > Assignee: Tsz Wo Nicholas Sze > Fix For: 2.8.0 > > Attachments: h8278_20150817.patch > > > DFS balancer mistakenly identifies a node with very little storage space remaining as an "underutilized" node and tries to move large amounts of data to that particular node. > All these block moves fail to execute successfully, as the % utilization is less relevant than the dfs remaining storage on that node. > {code} > 15/04/24 04:25:55 INFO balancer.Balancer: 0 over-utilized: [] > 15/04/24 04:25:55 INFO balancer.Balancer: 1 underutilized: [172.19.1.46:50010:DISK] > 15/04/24 04:25:55 INFO balancer.Balancer: Need to move 47.68 GB to make the cluster balanced. > 15/04/24 04:25:55 INFO balancer.Balancer: Decided to move 413.08 MB bytes from 172.19.1.52:50010:DISK to 172.19.1.46:50010:DISK > 15/04/24 04:25:55 INFO balancer.Balancer: Will move 413.08 MB in this iteration > 15/04/24 04:25:55 WARN balancer.Dispatcher: Failed to move blk_1078689321_1099517353638 with size=131146 from 172.19.1.52:50010:DISK to 172.19.1.46:50010:DISK through 172.19.1.53:50010: Got error, status message opReplaceBlock BP-942051088-172.18.1.41-1370508013893:blk_1078689321_1099517353638 received exception org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of space: The volume with the most available space (=225042432 B) is less than the block size (=268435456 B)., block move is failed > {code} > The machine in concern is under-full when it comes to the BP utilization, but has very little free space available for blocks. > {code} > Decommission Status : Normal > Configured Capacity: 3826907185152 (3.48 TB) > DFS Used: 2817262833664 (2.56 TB) > Non DFS Used: 1000621305856 (931.90 GB) > DFS Remaining: 9023045632 (8.40 GB) > DFS Used%: 73.62% > DFS Remaining%: 0.24% > Configured Cache Capacity: 8589934592 (8 GB) > Cache Used: 0 (0 B) > Cache Remaining: 8589934592 (8 GB) > Cache Used%: 0.00% > Cache Remaining%: 100.00% > Xceivers: 3 > Last contact: Fri Apr 24 04:28:36 PDT 2015 > {code} > The machine has 0.40 Gb of non-RAM storage available on that node, so it is futile to attempt to move any blocks to that particular machine. > This is a similar concern when a machine loses disks, since the comparisons of utilization always compare percentages per-node. Even that scenario needs to cap data movement to that node to the "DFS Remaining %" variable. > Trying to move any more data than that to a given node will always fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)