From hdfs-issues-return-226537-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Tue Jul 17 12:53:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 02398180600 for ; Tue, 17 Jul 2018 12:53:04 +0200 (CEST) Received: (qmail 306 invoked by uid 500); 17 Jul 2018 10:53:04 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 295 invoked by uid 99); 17 Jul 2018 10:53:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jul 2018 10:53:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 928DEC30F3 for ; Tue, 17 Jul 2018 10:53:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id wuGe_QK8Rpja for ; Tue, 17 Jul 2018 10:53:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 8B0365F2AA for ; Tue, 17 Jul 2018 10:53:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C1DFAE1219 for ; Tue, 17 Jul 2018 10:53:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4BC2A21EE2 for ; Tue, 17 Jul 2018 10:53:00 +0000 (UTC) Date: Tue, 17 Jul 2018 10:53:00 +0000 (UTC) From: "Hari Sekhon (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across DataNodes caused by uneven spread of Datanodes across Racks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-13739?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated HDFS-13739: ------------------------------- Description:=20 Current HDFS write pattern of "local node, rack local node, other rack node= " is good for most purposes but there are at least 2 scenarios where this i= s not ideal: # Rack-by-Rack Maintenance leaves data at risk of losing=C2=A0last=C2=A0re= maining replica.=C2=A0If a single data node failed it would likely cause so= me data outage or even data loss if the rack=C2=A0is lost or an upgrade fai= ls (perhaps it's a rack rebuild). Setting replicas to 4 would reduce write = performance and waste storage which is currently the only workaround to tha= t issue. # Major Storage Imabalnce across datanodes when there is an uneven layout = of datanodes across racks - some nodes fill up while others are half empty. I have observed this storage imbalance on a cluster where half the nodes we= re 85% full and the other half were only 50% full. Rack layouts like the following illustrate this - the nodes in the same rac= k will only choose to send half their block replicas to each other, so they= will fill up first, while other nodes will receive far fewer replica block= s: {code:java} NumNodes - Rack=20 2 - rack 1 2 - rack 2 1 - rack 3 1 - rack 4=20 1 - rack 5 1 - rack 6{code} In this case if I reduce the number of replicas to 2 then I get an almost p= erfect spread of blocks across all datanodes because HDFS has no choice but= to maintain the only 2nd replica on a different rack. If I increase the re= plicas back to 3 it goes back to 85% on half the nodes and 50% on the other= half, because the extra replicas choose to replicate only to rack local no= des. Why not just run the HDFS balancer to fix it you might say? This is a heavi= ly loaded HBase cluster - aside from destroying HBase's data locality and p= erformance by moving blocks out from underneath RegionServers - as soon as = an HBase major compaction occurs (at least weekly), all blocks will get re-= written by HBase and the HDFS client will again write to local node, rack l= ocal node, other rack node and resulting in the same storage imbalance agai= n. Hence this cannot be solved by running HDFS balancer on HBase clusters -= or for any application sitting on top of HDFS that has any HDFS block chur= n. was: Current HDFS write pattern of "local node, rack local node, other rack node= " is good for most purposes but there are at least 2 scenarios where this i= s not ideal: # Rack-by-Rack Maintenance leaves data at risk of losing=C2=A0last=C2=A0re= maining replica.=C2=A0If a single data node failed it would likely cause so= me data outage or even data loss if the rack=C2=A0is lost or an upgrade fai= ls (perhaps it's a complete rebuild upgrade). Setting replicas to 4 would r= educe write performance and waste storage which is currently the only worka= round to that issue. # Major Storage Imabalnce across datanodes when there is an uneven layout = of datanodes across racks - some nodes fill up while others are half empty. I have observed this storage imbalance on a cluster where half the nodes we= re 85% full and the other half were only 50% full. Rack layouts like the following illustrate this - the nodes in the same rac= k will only choose to send half their block replicas to each other, so they= will fill up first, while other nodes will receive far fewer replica block= s: {code:java} NumNodes - Rack=20 2 - rack 1 2 - rack 2 1 - rack 3 1 - rack 4=20 1 - rack 5 1 - rack 6{code} In this case if I reduce the number of replicas to 2 then I get an almost p= erfect spread of blocks across all datanodes because HDFS has no choice but= to maintain the only 2nd replica on a different rack. If I increase the re= plicas back to 3 it goes back to 85% on half the nodes and 50% on the other= half, because the extra replicas choose to replicate only to rack local no= des. Why not just run the HDFS balancer to fix it you might say? This is a heavi= ly loaded HBase cluster - aside from destroying HBase's data locality and p= erformance by moving blocks out from underneath RegionServers - as soon as = an HBase major compaction occurs (at least weekly), all blocks will get re-= written by HBase and the HDFS client will again write to local node, rack l= ocal node, other rack node and resulting in the same storage imbalance agai= n. Hence this cannot be solved by running HDFS balancer on HBase clusters -= or for any application sitting on top of HDFS that has any HDFS block chur= n. > Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack= -by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Stora= ge Imbalance across DataNodes caused by uneven spread of Datanodes across R= acks > -------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---- > > Key: HDFS-13739 > URL: https://issues.apache.org/jira/browse/HDFS-13739 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, block placement, datanode, fs,= hdfs, hdfs-client, namenode, nn, performance > Affects Versions: 2.7.3 > Environment: Hortonworks HDP 2.6 > Reporter: Hari Sekhon > Priority: Major > > Current HDFS write pattern of "local node, rack local node, other rack no= de" is good for most purposes but there are at least 2 scenarios where this= is not ideal: > # Rack-by-Rack Maintenance leaves data at risk of losing=C2=A0last=C2=A0= remaining replica.=C2=A0If a single data node failed it would likely cause = some data outage or even data loss if the rack=C2=A0is lost or an upgrade f= ails (perhaps it's a rack rebuild). Setting replicas to 4 would reduce writ= e performance and waste storage which is currently the only workaround to t= hat issue. > # Major Storage Imabalnce across datanodes when there is an uneven layou= t of datanodes across racks - some nodes fill up while others are half empt= y. > I have observed this storage imbalance on a cluster where half the nodes = were 85% full and the other half were only 50% full. > Rack layouts like the following illustrate this - the nodes in the same r= ack will only choose to send half their block replicas to each other, so th= ey will fill up first, while other nodes will receive far fewer replica blo= cks: > {code:java} > NumNodes - Rack=20 > 2 - rack 1 > 2 - rack 2 > 1 - rack 3 > 1 - rack 4=20 > 1 - rack 5 > 1 - rack 6{code} > In this case if I reduce the number of replicas to 2 then I get an almost= perfect spread of blocks across all datanodes because HDFS has no choice b= ut to maintain the only 2nd replica on a different rack. If I increase the = replicas back to 3 it goes back to 85% on half the nodes and 50% on the oth= er half, because the extra replicas choose to replicate only to rack local = nodes. > Why not just run the HDFS balancer to fix it you might say? This is a hea= vily loaded HBase cluster - aside from destroying HBase's data locality and= performance by moving blocks out from underneath RegionServers - as soon a= s an HBase major compaction occurs (at least weekly), all blocks will get r= e-written by HBase and the HDFS client will again write to local node, rack= local node, other rack node and resulting in the same storage imbalance ag= ain. Hence this cannot be solved by running HDFS balancer on HBase clusters= - or for any application sitting on top of HDFS that has any HDFS block ch= urn. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org