From hdfs-issues-return-226537-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Tue Jul 17 12:53:05 2018
Return-Path: <hdfs-issues-return-226537-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 02398180600
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 17 Jul 2018 12:53:04 +0200 (CEST)
Received: (qmail 306 invoked by uid 500); 17 Jul 2018 10:53:04 -0000
Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:hdfs-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:hdfs-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:hdfs-issues@hadoop.apache.org>
List-Id: <hdfs-issues.hadoop.apache.org>
Delivered-To: mailing list hdfs-issues@hadoop.apache.org
Received: (qmail 295 invoked by uid 99); 17 Jul 2018 10:53:04 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jul 2018 10:53:04 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 928DEC30F3
	for <hdfs-issues@hadoop.apache.org>; Tue, 17 Jul 2018 10:53:03 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -109.501
X-Spam-Level:
X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id wuGe_QK8Rpja for <hdfs-issues@hadoop.apache.org>;
	Tue, 17 Jul 2018 10:53:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 8B0365F2AA
	for <hdfs-issues@hadoop.apache.org>; Tue, 17 Jul 2018 10:53:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C1DFAE1219
	for <hdfs-issues@hadoop.apache.org>; Tue, 17 Jul 2018 10:53:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4BC2A21EE2
	for <hdfs-issues@hadoop.apache.org>; Tue, 17 Jul 2018 10:53:00 +0000 (UTC)
Date: Tue, 17 Jul 2018 10:53:00 +0000 (UTC)
From: "Hari Sekhon (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.13172599.1531820136000.20217.1531824780308@Atlassian.JIRA>
In-Reply-To: <JIRA.13172599.1531820136000@Atlassian.JIRA>
References: <JIRA.13172599.1531820136000@Atlassian.JIRA> <JIRA.13172599.1531820136548@jira-lw-us.apache.org>
Subject: [jira] [Updated] (HDFS-13739) Option to disable Rack Local Write
 Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data
 replica at risk, 2. avoid Major Storage Imbalance across DataNodes caused
 by uneven spread of Datanodes across Racks
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/HDFS-13739?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hari Sekhon updated HDFS-13739:
-------------------------------
    Description:=20
Current HDFS write pattern of "local node, rack local node, other rack node=
" is good for most purposes but there are at least 2 scenarios where this i=
s not ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing=C2=A0last=C2=A0re=
maining replica.=C2=A0If a single data node failed it would likely cause so=
me data outage or even data loss if the rack=C2=A0is lost or an upgrade fai=
ls (perhaps it's a rack rebuild). Setting replicas to 4 would reduce write =
performance and waste storage which is currently the only workaround to tha=
t issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout =
of datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes we=
re 85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rac=
k will only choose to send half their block replicas to each other, so they=
 will fill up first, while other nodes will receive far fewer replica block=
s:
{code:java}
NumNodes - Rack=20
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4=20
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost p=
erfect spread of blocks across all datanodes because HDFS has no choice but=
 to maintain the only 2nd replica on a different rack. If I increase the re=
plicas back to 3 it goes back to 85% on half the nodes and 50% on the other=
 half, because the extra replicas choose to replicate only to rack local no=
des.

Why not just run the HDFS balancer to fix it you might say? This is a heavi=
ly loaded HBase cluster - aside from destroying HBase's data locality and p=
erformance by moving blocks out from underneath RegionServers - as soon as =
an HBase major compaction occurs (at least weekly), all blocks will get re-=
written by HBase and the HDFS client will again write to local node, rack l=
ocal node, other rack node and resulting in the same storage imbalance agai=
n. Hence this cannot be solved by running HDFS balancer on HBase clusters -=
 or for any application sitting on top of HDFS that has any HDFS block chur=
n.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node=
" is good for most purposes but there are at least 2 scenarios where this i=
s not ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing=C2=A0last=C2=A0re=
maining replica.=C2=A0If a single data node failed it would likely cause so=
me data outage or even data loss if the rack=C2=A0is lost or an upgrade fai=
ls (perhaps it's a complete rebuild upgrade). Setting replicas to 4 would r=
educe write performance and waste storage which is currently the only worka=
round to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout =
of datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes we=
re 85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rac=
k will only choose to send half their block replicas to each other, so they=
 will fill up first, while other nodes will receive far fewer replica block=
s:
{code:java}
NumNodes - Rack=20
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4=20
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost p=
erfect spread of blocks across all datanodes because HDFS has no choice but=
 to maintain the only 2nd replica on a different rack. If I increase the re=
plicas back to 3 it goes back to 85% on half the nodes and 50% on the other=
 half, because the extra replicas choose to replicate only to rack local no=
des.

Why not just run the HDFS balancer to fix it you might say? This is a heavi=
ly loaded HBase cluster - aside from destroying HBase's data locality and p=
erformance by moving blocks out from underneath RegionServers - as soon as =
an HBase major compaction occurs (at least weekly), all blocks will get re-=
written by HBase and the HDFS client will again write to local node, rack l=
ocal node, other rack node and resulting in the same storage imbalance agai=
n. Hence this cannot be solved by running HDFS balancer on HBase clusters -=
 or for any application sitting on top of HDFS that has any HDFS block chur=
n.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack=
-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Stora=
ge Imbalance across DataNodes caused by uneven spread of Datanodes across R=
acks
> -------------------------------------------------------------------------=
---------------------------------------------------------------------------=
---------------------------------------------------------------------------=
----
>
>                 Key: HDFS-13739
>                 URL: https://issues.apache.org/jira/browse/HDFS-13739
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: balancer &amp; mover, block placement, datanode, fs,=
 hdfs, hdfs-client, namenode, nn, performance
>    Affects Versions: 2.7.3
>         Environment: Hortonworks HDP 2.6
>            Reporter: Hari Sekhon
>            Priority: Major
>
> Current HDFS write pattern of "local node, rack local node, other rack no=
de" is good for most purposes but there are at least 2 scenarios where this=
 is not ideal:
>  # Rack-by-Rack Maintenance leaves data at risk of losing=C2=A0last=C2=A0=
remaining replica.=C2=A0If a single data node failed it would likely cause =
some data outage or even data loss if the rack=C2=A0is lost or an upgrade f=
ails (perhaps it's a rack rebuild). Setting replicas to 4 would reduce writ=
e performance and waste storage which is currently the only workaround to t=
hat issue.
>  # Major Storage Imabalnce across datanodes when there is an uneven layou=
t of datanodes across racks - some nodes fill up while others are half empt=
y.
> I have observed this storage imbalance on a cluster where half the nodes =
were 85% full and the other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same r=
ack will only choose to send half their block replicas to each other, so th=
ey will fill up first, while other nodes will receive far fewer replica blo=
cks:
> {code:java}
> NumNodes - Rack=20
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4=20
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost=
 perfect spread of blocks across all datanodes because HDFS has no choice b=
ut to maintain the only 2nd replica on a different rack. If I increase the =
replicas back to 3 it goes back to 85% on half the nodes and 50% on the oth=
er half, because the extra replicas choose to replicate only to rack local =
nodes.
> Why not just run the HDFS balancer to fix it you might say? This is a hea=
vily loaded HBase cluster - aside from destroying HBase's data locality and=
 performance by moving blocks out from underneath RegionServers - as soon a=
s an HBase major compaction occurs (at least weekly), all blocks will get r=
e-written by HBase and the HDFS client will again write to local node, rack=
 local node, other rack node and resulting in the same storage imbalance ag=
ain. Hence this cannot be solved by running HDFS balancer on HBase clusters=
 - or for any application sitting on top of HDFS that has any HDFS block ch=
urn.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org