hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liuyiyang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-12077) Implement a remaining space based balancer policy
Date Sun, 02 Jul 2017 14:22:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-12077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

liuyiyang updated HDFS-12077:
-----------------------------
    Description: 
Our cluster has DataNodes with 2T disk storage, as storage utilization of the cluster growing,
we need to add new DataNodes to increse the capacity of our cluster. In order to make utilization
of every DataNode be in relatively balanced state, usually we use HDFS balancer tool to balance
our cluster every time we add new DataNodes.
We have been facing an issue with heterogeneous disk capacity when using HDFS balancer tool.
In production cluster, we often have to add new DataNodes with larger disk capacity than previous
DNs. Since the original balancer is implemented to balance utilization of every DataNode,
the balancer will make every DN's utilization and average utilization of the cluster be within
a given threshold.
For example, in a cluster with two DataNodes DN1 and DN2, DN1 has ten disks with 2T capacity,
DN2 has ten disks with 10T capacity, the original balancer may make the cluster balanced in
the following state:
||DataNode||Total Capacity||Used||Remaining|| utilization||
|DN1   |     20T  |          18T|     2T| 90%|
|DN2    |    100T       |    90T   |  10T|90%|
each DN has reached a 90% utilization, in such a case, DN1's capacibility to store new blocks
is far less than DN2's. When DN1 is full, all of the new blocks will be written to DN2 and
more MR tasks will be scheduled to DN2. As a result, DN2 is overloaded and we can not 
make full use of each DN's I/O capacity. In such a case, We wish the balancer could run based
on remaining space of every DN. After balancing, every DN's remaining space could be balanced
like the following state:
||DataNode  ||Total Capacity || Used  ||Remaining||utilization||
 |DN1   |      20T     |        14T |     6T |70%|
 |DN2       |  100T         |   94T    |  6T |94%|
In a cluster with balanced remaining space of DN's capacity, every DN will be utilized when
writing new blocks to the cluster,  on the other hand,  every DN's I/O capacity can be utilized
when running MR jobs.



Please let me know what you guys think.  I will attach a patch if necessary.



  was:
Our cluster has DataNodes with 2T disk storage, as storage utilization of the cluster growing,
we need to add new DataNodes to increse the capacity of our cluster. In order to make utilization
of every DataNode be in relatively balanced state, usually we use HDFS balancer tool to balance
our cluster every time we add new DataNodes.
We have been facing an issue with heterogeneous disk capacity when using HDFS balancer tool.
In production cluster, we often have to add new DataNodes with larger disk capacity than previous
DNs. Since the original balancer is implemented to balance utilization of every DataNode,
the balancer will make every DN's utilization and average utilization of the cluster be within
a given threshold.
For example, in a cluster with two DataNodes DN1 and DN2, DN1 has ten disks with 2T capacity,
DN2 has ten disks with 10T capacity, the original balancer may make the cluster balanced in
the following state:
||DataNode||Total Capacity||Used||Remaining|| utilization||
|DN1   |     20T  |          18T|     2T| 90%|
|DN2    |    100T       |    90T   |  10T|90%|
each DN has reached an 90% utilization, in such a case, DN1's capacibility to store new blocks
is far less than DN2's. When DN1 is full, all of the new blocks will be written to DN2 and
more MR tasks will be scheduled to DN2. As a result, DN2 is overloaded and we can not 
make full use of each DN's I/O capacity. In such a case, We wish the balancer could run based
on remaining space of every DN. After balancing, every DN's remaining space could be balanced
like the following state:
||DataNode  ||Total Capacity || Used  ||Remaining||utilization||
 |DN1   |      20T     |        14T |     6T |70%|
 |DN2       |  100T         |   94T    |  6T |94%|
In a cluster with balanced remaining space of DN's capacity, every DN will be utilized when
writing new blocks to the cluster,  on the other hand,  every DN's I/O capacity can be utilized
when running MR jobs.



Please let me know what you guys think.  I will attach a patch if necessary.




> Implement a remaining space based balancer policy
> -------------------------------------------------
>
>                 Key: HDFS-12077
>                 URL: https://issues.apache.org/jira/browse/HDFS-12077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: balancer & mover
>    Affects Versions: 2.6.0
>            Reporter: liuyiyang
>
> Our cluster has DataNodes with 2T disk storage, as storage utilization of the cluster
growing, we need to add new DataNodes to increse the capacity of our cluster. In order to
make utilization of every DataNode be in relatively balanced state, usually we use HDFS balancer
tool to balance our cluster every time we add new DataNodes.
> We have been facing an issue with heterogeneous disk capacity when using HDFS balancer
tool. In production cluster, we often have to add new DataNodes with larger disk capacity
than previous DNs. Since the original balancer is implemented to balance utilization of every
DataNode, the balancer will make every DN's utilization and average utilization of the cluster
be within a given threshold.
> For example, in a cluster with two DataNodes DN1 and DN2, DN1 has ten disks with 2T capacity,
DN2 has ten disks with 10T capacity, the original balancer may make the cluster balanced in
the following state:
> ||DataNode||Total Capacity||Used||Remaining|| utilization||
> |DN1   |     20T  |          18T|     2T| 90%|
> |DN2    |    100T       |    90T   |  10T|90%|
> each DN has reached a 90% utilization, in such a case, DN1's capacibility to store new
blocks is far less than DN2's. When DN1 is full, all of the new blocks will be written to
DN2 and more MR tasks will be scheduled to DN2. As a result, DN2 is overloaded and we can
not 
> make full use of each DN's I/O capacity. In such a case, We wish the balancer could run
based on remaining space of every DN. After balancing, every DN's remaining space could be
balanced like the following state:
> ||DataNode  ||Total Capacity || Used  ||Remaining||utilization||
>  |DN1   |      20T     |        14T |     6T |70%|
>  |DN2       |  100T         |   94T    |  6T |94%|
> In a cluster with balanced remaining space of DN's capacity, every DN will be utilized
when writing new blocks to the cluster,  on the other hand,  every DN's I/O capacity can be
utilized when running MR jobs.
> Please let me know what you guys think.  I will attach a patch if necessary.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message