hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Алексей Бабутин <zorlaxpokemon...@gmail.com>
Subject Re: disk used percentage is not symmetric on datanodes (balancer)
Date Fri, 22 Mar 2013 16:05:34 GMT
2013/3/20 Tapas Sarangi <tapas.sarangi@gmail.com>

> Thanks for your reply. Some follow up questions below :
> On Mar 20, 2013, at 5:35 AM, Алексей Бабутин <zorlaxpokemonych@gmail.com>
> wrote:
> dfs.balance.bandwidthPerSec in hdfs-site.xml.I think balancer cant help
> you,because it makes all the nodes equal.They can differ only on balancer
> threshold.Threshold =10 by default.It means,that nodes can differ up to
> 350Tb between each other in 3.5Pb cluster.If Threshold =1 up to 35Tb and so
> on.
> If we use multiple racks, let's assume we have 10 racks now and they are
> equally divided in size (350 TB each). With a default threshold of 10, any
> two nodes on a given rack will have a maximum difference of 35 TB, is this
> correct ? Also, does this mean the difference between any two racks will
> also go down to 35 TB ?

Balancer know about topology,but when calculate balancing it operates only
with nodes not with racks.
You can see how it work in Balancer.java in  BalancerDatanode about string

I was wrong about 350Tb,35Tb it calculates in such way :

For example:

avgutil=cluster_dfsused/cluster_capacity*100=57.14% used cluster capacity
Then we know avg node utilization (node_dfsused/node_capacity*100)
.Balancer think that all good if  avgutil

Ideal case that all node used avgutl of capacity.but for 12TB node its only
6.5Tb and for 72Tb its about 40Tb.

Balancer cant help you.

Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=LIVE if
you can.

> In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb you
> will be able to have only 12Tb replication data.
> Yes, this is true for exactly two nodes in the cluster with 12 TB and 72
> TB, but not true for more than two nodes in the cluster.
> Best way,on my opinion,it is using multiple racks.Nodes in rack must be
> with identical capacity.Racks must be identical capacity.
> For example:
> rack1: 1 node with 72Tb
> rack2: 6 nodes with 12Tb
> rack3: 3 nodes with 24Tb
> It helps with balancing,because dublicated  block must be another rack.
> The same question I asked earlier in this message, does multiple racks
> with default threshold for the balancer minimizes the difference between
> racks ?
> Why did you select hdfs?May be lustre,cephfs and other is better choise.
> It wasn't my decision, and I probably can't change it now. I am new to
> this cluster and trying to understand few issues. I will explore other
> options as you mentioned.

View raw message