From hdfs-issues-return-226108-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Wed Jul 11 18:34:03 2018
Return-Path: <hdfs-issues-return-226108-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 3B8D018062A
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 11 Jul 2018 18:34:03 +0200 (CEST)
Received: (qmail 88341 invoked by uid 500); 11 Jul 2018 16:34:02 -0000
Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:hdfs-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:hdfs-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:hdfs-issues@hadoop.apache.org>
List-Id: <hdfs-issues.hadoop.apache.org>
Delivered-To: mailing list hdfs-issues@hadoop.apache.org
Received: (qmail 88328 invoked by uid 99); 11 Jul 2018 16:34:02 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jul 2018 16:34:02 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A1DD6CC728
	for <hdfs-issues@hadoop.apache.org>; Wed, 11 Jul 2018 16:34:01 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.301
X-Spam-Level:
X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100]
	autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id JU6rFEy8OPoD for <hdfs-issues@hadoop.apache.org>;
	Wed, 11 Jul 2018 16:34:01 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id D14975F19D
	for <hdfs-issues@hadoop.apache.org>; Wed, 11 Jul 2018 16:34:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7615BE0DAA
	for <hdfs-issues@hadoop.apache.org>; Wed, 11 Jul 2018 16:34:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 355F92713C
	for <hdfs-issues@hadoop.apache.org>; Wed, 11 Jul 2018 16:34:00 +0000 (UTC)
Date: Wed, 11 Jul 2018 16:34:00 +0000 (UTC)
From: "Stephen O'Donnell (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.13171463.1531326784000.24193.1531326840216@Atlassian.JIRA>
In-Reply-To: <JIRA.13171463.1531326784000@Atlassian.JIRA>
References: <JIRA.13171463.1531326784000@Atlassian.JIRA> <JIRA.13171463.1531326784616@jira-lw-us.apache.org>
Subject: [jira] [Created] (HDFS-13728) Disk Balaner should not fail if
 volume usage is greater than capacity
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394

Stephen O'Donnell created HDFS-13728:
----------------------------------------

             Summary: Disk Balaner should not fail if volume usage is greater than capacity
                 Key: HDFS-13728
                 URL: https://issues.apache.org/jira/browse/HDFS-13728
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: diskbalancer
    Affects Versions: 3.0.3
            Reporter: Stephen O'Donnell


We have seen a couple of scenarios where the disk balancer fails because a datanode reports more spaced used on a disk than its capacity, which should not be possible.

This is due to the check below in DiskBalancerVolume.java:

{code}
  public void setUsed(long dfsUsedSpace) {
    Preconditions.checkArgument(dfsUsedSpace < this.getCapacity(),
        "DiskBalancerVolume.setUsed: dfsUsedSpace(%s) < capacity(%s)",
        dfsUsedSpace, getCapacity());
    this.used = dfsUsedSpace;
  }
{code}

While I agree that it should not be possible for a DN to report more usage on a volume than its capacity, there seems to be some issue that causes this to occur sometimes.

In general, this full disk is what causes someone to want to run the Disk Balancer, only to find it fails with the error.

There appears to be nothing you can do to force the Disk Balancer to run at this point, but in the scenarios I saw, some data was removed from the disk and usage dropped below the capacity resolving the issue.

Can we considered relaxing the above check, and if the usage is greater than the capacity, just set the usage to the capacity so the calculations all work ok?

Eg something like this:

{code}
   public void setUsed(long dfsUsedSpace) {
-    Preconditions.checkArgument(dfsUsedSpace < this.getCapacity());
-    this.used = dfsUsedSpace;
+    if (dfsUsedSpace > this.getCapacity()) {
+      this.used = this.getCapacity();
+    } else {
+      this.used = dfsUsedSpace;
+    }
   }
{code}


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org