hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Sekhon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across DataNodes caused by uneven spread of Datanodes across Racks
Date Tue, 17 Jul 2018 10:53:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hari Sekhon updated HDFS-13739:
-------------------------------
    Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is good for most
purposes but there are at least 2 scenarios where this is not ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining replica. If a
single data node failed it would likely cause some data outage or even data loss if the rack is
lost or an upgrade fails (perhaps it's a rack rebuild). Setting replicas to 4 would reduce
write performance and waste storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of datanodes across
racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 85% full and
the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack will only choose
to send half their block replicas to each other, so they will fill up first, while other nodes
will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost perfect spread of
blocks across all datanodes because HDFS has no choice but to maintain the only 2nd replica
on a different rack. If I increase the replicas back to 3 it goes back to 85% on half the
nodes and 50% on the other half, because the extra replicas choose to replicate only to rack
local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily loaded HBase
cluster - aside from destroying HBase's data locality and performance by moving blocks out
from underneath RegionServers - as soon as an HBase major compaction occurs (at least weekly),
all blocks will get re-written by HBase and the HDFS client will again write to local node,
rack local node, other rack node and resulting in the same storage imbalance again. Hence
this cannot be solved by running HDFS balancer on HBase clusters - or for any application
sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is good for most
purposes but there are at least 2 scenarios where this is not ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining replica. If a
single data node failed it would likely cause some data outage or even data loss if the rack is
lost or an upgrade fails (perhaps it's a complete rebuild upgrade). Setting replicas to 4
would reduce write performance and waste storage which is currently the only workaround to
that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of datanodes across
racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 85% full and
the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack will only choose
to send half their block replicas to each other, so they will fill up first, while other nodes
will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost perfect spread of
blocks across all datanodes because HDFS has no choice but to maintain the only 2nd replica
on a different rack. If I increase the replicas back to 3 it goes back to 85% on half the
nodes and 50% on the other half, because the extra replicas choose to replicate only to rack
local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily loaded HBase
cluster - aside from destroying HBase's data locality and performance by moving blocks out
from underneath RegionServers - as soon as an HBase major compaction occurs (at least weekly),
all blocks will get re-written by HBase and the HDFS client will again write to local node,
rack local node, other rack node and resulting in the same storage imbalance again. Hence
this cannot be solved by running HDFS balancer on HBase clusters - or for any application
sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance
leaves last data replica at risk, 2. avoid Major Storage Imbalance across DataNodes caused
by uneven spread of Datanodes across Racks
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13739
>                 URL: https://issues.apache.org/jira/browse/HDFS-13739
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: balancer &amp; mover, block placement, datanode, fs, hdfs, hdfs-client,
namenode, nn, performance
>    Affects Versions: 2.7.3
>         Environment: Hortonworks HDP 2.6
>            Reporter: Hari Sekhon
>            Priority: Major
>
> Current HDFS write pattern of "local node, rack local node, other rack node" is good
for most purposes but there are at least 2 scenarios where this is not ideal:
>  # Rack-by-Rack Maintenance leaves data at risk of losing last remaining replica. If
a single data node failed it would likely cause some data outage or even data loss if the
rack is lost or an upgrade fails (perhaps it's a rack rebuild). Setting replicas to 4 would
reduce write performance and waste storage which is currently the only workaround to that
issue.
>  # Major Storage Imabalnce across datanodes when there is an uneven layout of datanodes
across racks - some nodes fill up while others are half empty.
> I have observed this storage imbalance on a cluster where half the nodes were 85% full
and the other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same rack will only
choose to send half their block replicas to each other, so they will fill up first, while
other nodes will receive far fewer replica blocks:
> {code:java}
> NumNodes - Rack 
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4 
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost perfect spread
of blocks across all datanodes because HDFS has no choice but to maintain the only 2nd replica
on a different rack. If I increase the replicas back to 3 it goes back to 85% on half the
nodes and 50% on the other half, because the extra replicas choose to replicate only to rack
local nodes.
> Why not just run the HDFS balancer to fix it you might say? This is a heavily loaded
HBase cluster - aside from destroying HBase's data locality and performance by moving blocks
out from underneath RegionServers - as soon as an HBase major compaction occurs (at least
weekly), all blocks will get re-written by HBase and the HDFS client will again write to local
node, rack local node, other rack node and resulting in the same storage imbalance again.
Hence this cannot be solved by running HDFS balancer on HBase clusters - or for any application
sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message