Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4EF1AF197 for ; Mon, 25 Mar 2013 01:25:45 +0000 (UTC) Received: (qmail 25280 invoked by uid 500); 25 Mar 2013 01:25:40 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 25167 invoked by uid 500); 25 Mar 2013 01:25:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 25160 invoked by uid 99); 25 Mar 2013 01:25:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Mar 2013 01:25:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tapas.sarangi@gmail.com designates 209.85.223.170 as permitted sender) Received: from [209.85.223.170] (HELO mail-ie0-f170.google.com) (209.85.223.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Mar 2013 01:25:35 +0000 Received: by mail-ie0-f170.google.com with SMTP id c11so6917785ieb.29 for ; Sun, 24 Mar 2013 18:25:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:content-type:message-id:mime-version:subject:date :references:to:in-reply-to:x-mailer; bh=UD+KEve8JVM9GZ+G2nRed6wzFsP5UMwReGHp1lk5G0Y=; b=hIN+KUwCTCkwiqnj3a89dct0BiQ4y9jm+RkcMq9rWK0jt9el5zusLFmybwtICwKCzR DIjUX2g2FGEtpLE6aSM0fik/VBBs30A8IZYpB4Jq+6DE/MHpAmc5ulU56U4bpYjK+F5i gPm/LLyeXYNkOlfEJzoZ0Q/ksDclDjRJCQd1hNdjxxjNcS28AF4czJLZfeuVfsmxxuKj RuF5S5VLf5uvjHOUSyhjsCicqcal8W0Li5/bOhsHlyduoZhLMGxzkuMuiP+f8/ul8x8K 0MLIEDLhOy1YV/yNwKJQFX3JSJmNBoEIFUAcIPwS7Zep2KFMbbiwXSyEYFY7XcuKEuXC CuPQ== X-Received: by 10.50.152.132 with SMTP id uy4mr6521643igb.62.1364174715063; Sun, 24 Mar 2013 18:25:15 -0700 (PDT) Received: from ?IPv6:::1? (login06.hep.wisc.edu. [144.92.181.245]) by mx.google.com with ESMTPS id qs4sm12400520igb.10.2013.03.24.18.25.12 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 24 Mar 2013 18:25:13 -0700 (PDT) From: Tapas Sarangi Content-Type: multipart/alternative; boundary="Apple-Mail=_0919EDD0-6F29-41DF-928D-53208E152657" Message-Id: <493C5661-B715-433F-A3AC-65445A4F1D91@gmail.com> Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: disk used percentage is not symmetric on datanodes (balancer) Date: Sun, 24 Mar 2013 20:25:11 -0500 References: <522E52B1-497C-4D8D-9014-0182E8B9AABB@gmail.com> <19B0FB3B-40CF-435F-A120-3B4FBA83A9AF@gmail.com> <2068CE03-68B2-4AE6-9CD8-F590DD57C7E3@gmail.com> <9C2B8D8D-8A9E-4A50-B46B-1E00EC5F763E@gmail.com> <30A770CA-2647-4CDC-A474-94E0D8D08747@gmail.com> <30AAAB96-6591-403A-90C5-B746D88C948B@gmai l.com> To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_0919EDD0-6F29-41DF-928D-53208E152657 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Thanks. Does this need a restart of hadoop in the nodes where this = modification is made ? ----- On Mar 24, 2013, at 8:06 PM, Jamal B wrote: > dfs.datanode.du.reserved >=20 > You could tweak that param on the smaller nodes to "force" the flow of = blocks to other nodes. A short term hack at best, but should help the = situation a bit. >=20 > On Mar 24, 2013 7:09 PM, "Tapas Sarangi" = wrote: >=20 > On Mar 24, 2013, at 4:34 PM, Jamal B wrote: >=20 >> It shouldn't cause further problems since most of your small nodes = are already their capacity. You could set or increase the dfs reserved = property on your smaller nodes to force the flow of blocks onto the = larger nodes. >>=20 >>=20 >=20 > Thanks. Can you please specify which are the dfs properties that we = can set or modify to force the flow of blocks directed towards the = larger nodes than the smaller nodes ? >=20 > ----- >=20 >=20 >=20 >>=20 >=20 >=20 >> On Mar 24, 2013 4:45 PM, "Tapas Sarangi" = wrote: >> Hi, >>=20 >> Thanks for the idea, I will give this a try and report back.=20 >>=20 >> My worry is if we decommission a small node (one at a time), will it = move the data to larger nodes or choke another smaller nodes ? In = principle it should distribute the blocks, the point is it is not = distributing the way we expect it to, so do you think this may cause = further problems ? >>=20 >> --------- >>=20 >> On Mar 24, 2013, at 3:37 PM, Jamal B wrote: >>=20 >>> Then I think the only way around this would be to decommission 1 at = a time, the smaller nodes, and ensure that the blocks are moved to the = larger nodes. =20 >>> And once complete, bring back in the smaller nodes, but maybe only = after you tweak the rack topology to match your disk layout more than = network layout to compensate for the unbalanced nodes. =20 >>>=20 >>> Just my 2 cents >>>=20 >>>=20 >>> On Sun, Mar 24, 2013 at 4:31 PM, Tapas Sarangi = wrote: >>> Thanks. We have a 1-1 configuration of drives and folder in all the = datanodes. >>>=20 >>> -Tapas >>>=20 >>> On Mar 24, 2013, at 3:29 PM, Jamal B wrote: >>>=20 >>>> On both types of nodes, what is your dfs.data.dir set to? Does it = specify multiple folders on the same set's of drives or is it 1-1 = between folder and drive? If it's set to multiple folders on the same = drives, it is probably multiplying the amount of "available capacity" = incorrectly in that it assumes a 1-1 relationship between folder and = total capacity of the drive. >>>>=20 >>>>=20 >>>> On Sun, Mar 24, 2013 at 3:01 PM, Tapas Sarangi = wrote: >>>> Yes, thanks for pointing, but I already know that it is completing = the balancing when exiting otherwise it shouldn't exit.=20 >>>> Your answer doesn't solve the problem I mentioned earlier in my = message. 'hdfs' is stalling and hadoop is not writing unless space is = cleared up from the cluster even though "df" shows the cluster has about = 500 TB of free space.=20 >>>>=20 >>>> ------- >>>> =20 >>>>=20 >>>> On Mar 24, 2013, at 1:54 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE= =B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF= =E0=AE=A3=E0=AE=A9=E0=AF=8D) wrote: >>>>=20 >>>>> -setBalancerBandwidth >>>>>=20 >>>>> So the value is bytes per second. If it is running and exiting,it = means it has completed the balancing.=20 >>>>>=20 >>>>>=20 >>>>> On 24 March 2013 11:32, Tapas Sarangi = wrote: >>>>> Yes, we are running balancer, though a balancer process runs for = almost a day or more before exiting and starting over. >>>>> Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I = assume that's bytes so about 2 GigaByte/sec. Shouldn't that be = reasonable ? If it is in Bits then we have a problem. >>>>> What's the unit for "dfs.balance.bandwidthPerSec" ? >>>>>=20 >>>>> ----- >>>>>=20 >>>>> On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0= =AE=B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE= =AF=E0=AE=A3=E0=AE=A9=E0=AF=8D) wrote: >>>>>=20 >>>>>> Are you running balancer? If balancer is running and if it is = slow, try increasing the balancer bandwidth >>>>>>=20 >>>>>>=20 >>>>>> On 24 March 2013 09:21, Tapas Sarangi = wrote: >>>>>> Thanks for the follow up. I don't know whether attachment will = pass through this mailing list, but I am attaching a pdf that contains = the usage of all live nodes. >>>>>>=20 >>>>>> All nodes starting with letter "g" are the ones with smaller = storage space where as nodes starting with letter "s" have larger = storage space. As you will see, most of the "gXX" nodes are completely = full whereas "sXX" nodes have a lot of unused space.=20 >>>>>>=20 >>>>>> Recently, we are facing crisis frequently as 'hdfs' goes into a = mode where it is not able to write any further even though the total = space available in the cluster is about 500 TB. We believe this has = something to do with the way it is balancing the nodes, but don't = understand the problem yet. May be the attached PDF will help some of = you (experts) to see what is going wrong here... >>>>>>=20 >>>>>> Thanks >>>>>> ------ >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>>=20 >>>>>>> Balancer know about topology,but when calculate balancing it = operates only with nodes not with racks. >>>>>>> You can see how it work in Balancer.java in BalancerDatanode = about string 509. >>>>>>>=20 >>>>>>> I was wrong about 350Tb,35Tb it calculates in such way : >>>>>>>=20 >>>>>>> For example: >>>>>>> cluster_capacity=3D3.5Pb >>>>>>> cluster_dfsused=3D2Pb >>>>>>>=20 >>>>>>> avgutil=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used = cluster capacity >>>>>>> Then we know avg node utilization = (node_dfsused/node_capacity*100) .Balancer think that all good if = avgutil +10>node_utilizazation>=3Davgutil-10. >>>>>>>=20 >>>>>>> Ideal case that all node used avgutl of capacity.but for 12TB = node its only 6.5Tb and for 72Tb its about 40Tb. >>>>>>>=20 >>>>>>> Balancer cant help you. >>>>>>>=20 >>>>>>> Show me = http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=3DLIVE if you = can. >>>>>>>=20 >>>>>>> =20 >>>>>>>=20 >>>>>>>=20 >>>>>>>> In ideal case with replication factor 2 ,with two nodes 12Tb = and 72Tb you will be able to have only 12Tb replication data. >>>>>>>=20 >>>>>>> Yes, this is true for exactly two nodes in the cluster with 12 = TB and 72 TB, but not true for more than two nodes in the cluster. >>>>>>>=20 >>>>>>>>=20 >>>>>>>> Best way,on my opinion,it is using multiple racks.Nodes in rack = must be with identical capacity.Racks must be identical capacity. >>>>>>>> For example: >>>>>>>>=20 >>>>>>>> rack1: 1 node with 72Tb >>>>>>>> rack2: 6 nodes with 12Tb >>>>>>>> rack3: 3 nodes with 24Tb >>>>>>>>=20 >>>>>>>> It helps with balancing,because dublicated block must be = another rack. >>>>>>>>=20 >>>>>>>=20 >>>>>>> The same question I asked earlier in this message, does multiple = racks with default threshold for the balancer minimizes the difference = between racks ? >>>>>>>=20 >>>>>>>> Why did you select hdfs?May be lustre,cephfs and other is = better choise. =20 >>>>>>>=20 >>>>>>> It wasn't my decision, and I probably can't change it now. I am = new to this cluster and trying to understand few issues. I will explore = other options as you mentioned. >>>>>>>=20 >>>>>>> --=20 >>>>>>> http://balajin.net/blog >>>>>>> http://flic.kr/balajijegan >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>> --=20 >>>>> http://balajin.net/blog >>>>> http://flic.kr/balajijegan >>>>=20 >>>>=20 >>>=20 >>>=20 >>=20 >=20 --Apple-Mail=_0919EDD0-6F29-41DF-928D-53208E152657 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 jm15119b@gmail.com> = wrote:

dfs.datanode.du.reserved

You could tweak that param on the smaller nodes to "force" = the flow of blocks to other nodes.   A short term hack at = best, but should help the situation a bit.

On Mar 24, 2013 7:09 PM, "Tapas Sarangi" = <tapas.sarangi@gmail.com> = wrote:

On Mar 24, 2013, = at 4:34 PM, Jamal B <jm15119b@gmail.com> wrote:

It shouldn't cause further problems since most of your = small nodes are already their capacity.  You could set or increase = the dfs reserved property on your smaller nodes to force the flow of = blocks onto the larger nodes.



Thanks.  Can you = please specify which are the dfs properties that we can set or modify to = force the flow of blocks directed towards the larger nodes than the = smaller nodes ?
=

-----






On Mar 24, 2013 4:45 PM, "Tapas Sarangi" = <tapas.sarangi@gmail.com> wrote:
Hi,

Thanks for = the idea, I will give this a try and report = back. 

My worry is if we decommission a = small node (one at a time), will it move the data to larger nodes or = choke another smaller nodes ? In principle it should distribute the = blocks, the point is it is not distributing the way we expect it to, so = do you think this may cause further problems ?

---------

On Mar 24, 2013, = at 3:37 PM, Jamal B <jm15119b@gmail.com> wrote:

Then I think the only way around this would be = to decommission  1 at a time, the smaller nodes, and ensure = that the blocks are moved to the larger nodes. =  
And = once complete, bring back in the smaller nodes, but maybe only after you = tweak the rack topology to match your disk layout more than network = layout to compensate for the unbalanced nodes.  

Just my 2 cents


On Sun, Mar 24, = 2013 at 4:31 PM, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Thanks. We have a 1-1 configuration of = drives and folder in all the datanodes.

-Tapas

On Mar 24, 2013, at = 3:29 PM, Jamal B <jm15119b@gmail.com> wrote:

On both types of nodes, what is your dfs.data.dir set = to? Does it specify multiple folders on the same set's of drives or is = it 1-1 between folder and drive?  If it's set to multiple folders = on the same drives, it is probably multiplying the amount of = "available capacity" incorrectly in that it assumes a 1-1 relationship = between folder and total capacity of the drive.


On Sun, = Mar 24, 2013 at 3:01 PM, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Yes, thanks for pointing, but I already = know that it is completing the balancing when exiting otherwise it = shouldn't exit. 
Your answer doesn't solve the problem I mentioned earlier in my message. = 'hdfs' is stalling and hadoop is not writing unless space is cleared up = from the cluster even though "df" shows the cluster has about 500 TB of = free space. 

-------
 

On Mar 24, = 2013, at 1:54 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=B2=E0=AE=BE=E0= =AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=E0=AE=A3=E0=AE= =A9=E0=AF=8D) <balaji@balajin.net> wrote:

 -setBalancerBandwidth <bandwidth in = bytes per second>

So the value is bytes per second. If = it is running and exiting,it means it has completed the balancing.


On = 24 March 2013 11:32, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Yes, we are running balancer, though = a balancer process runs for almost a day or more before exiting and = starting over.
Current dfs.balance.bandwidthPerSec value is = set to 2x10^9. I assume that's bytes so about 2 GigaByte/sec. Shouldn't = that be reasonable ? If it is in Bits then we have a problem.
What's the unit for "dfs.balance.bandwidthPerSec" = ?

-----

On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE= =B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF= =E0=AE=A3=E0=AE=A9=E0=AF=8D) <lists@balajin.net> wrote:

Are you running balancer? If balancer is = running and if it is slow, try increasing the balancer bandwidth


On = 24 March 2013 09:21, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Thanks for the follow up. I don't = know whether attachment will pass through this mailing list, but I am = attaching a pdf that contains the usage of all live nodes.

All nodes starting with letter "g" are the ones with = smaller storage space where as nodes starting with letter "s" have = larger storage space. As you will see, most of the "gXX" nodes are = completely full whereas "sXX" nodes have a lot of unused = space. 

Recently, we are facing crisis frequently as 'hdfs' = goes into a mode where it is not able to write any further even though = the total space available in the cluster is about 500 TB. We believe = this has something to do with the way it is balancing the nodes, but = don't understand the problem yet. May be the attached PDF will help some = of you (experts) to see what is going wrong here...
=

Thanks
------


<= /div>





Balancer know about topology,but when calculate balancing it = operates only with nodes not with racks.
You can see how it work in = Balancer.java in  BalancerDatanode about string 509.

I was wrong about 350Tb,35Tb it calculates in such way :

For = example:
cluster_capacity=3D3.5Pb
cluster_dfsused=3D2Pb

avgut= il=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster = capacity
Then we know avg node utilization = (node_dfsused/node_capacity*100) .Balancer think that all good if  = avgutil +10>node_utilizazation>=3Davgutil-10.

Ideal case that all node used avgutl of capacity.but for 12TB node = its only 6.5Tb and for 72Tb its about 40Tb.

Balancer cant help = you.

Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNod= es=3DLIVE if you can.

 


In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb = you will be able to have only 12Tb replication = data.

Yes, this is true = for exactly two nodes in the cluster with 12 TB and 72 TB, but not true = for more than two nodes in the cluster.


Best way,on my opinion,it is using multiple racks.Nodes in rack must be with identical capacity.Racks must be identical capacity.
For example:

rack1: 1 node with 72Tb
rack2: 6 nodes with = 12Tb
rack3: 3 nodes with 24Tb

It helps with balancing,because = dublicated  block must be another = rack.


The same question I asked earlier in this message, does = multiple racks with default threshold for the balancer minimizes the = difference between racks ?

Why did you select hdfs?May be lustre,cephfs and other is better = choise. 

It wasn't my = decision, and I probably can't change it now. I am new to this cluster = and trying to understand few issues. I will explore other options as you = mentioned.

--
http://balajin.net/blog
http://flic.kr/balajijegan =




--
http://balajin.net/blog
http://flic.kr/balajijegan
=







= --Apple-Mail=_0919EDD0-6F29-41DF-928D-53208E152657--