Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D71EF2CD for ; Sun, 24 Mar 2013 20:49:29 +0000 (UTC) Received: (qmail 6243 invoked by uid 500); 24 Mar 2013 20:49:24 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 6151 invoked by uid 500); 24 Mar 2013 20:49:24 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 6141 invoked by uid 99); 24 Mar 2013 20:49:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Mar 2013 20:49:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tapas.sarangi@gmail.com designates 209.85.210.180 as permitted sender) Received: from [209.85.210.180] (HELO mail-ia0-f180.google.com) (209.85.210.180) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Mar 2013 20:49:20 +0000 Received: by mail-ia0-f180.google.com with SMTP id f27so5045140iae.11 for ; Sun, 24 Mar 2013 13:48:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:content-type:message-id:mime-version:subject:date :references:to:in-reply-to:x-mailer; bh=QQ8Xzizs1EHXDXUyZOHhzgko/1/x9Zv3knYrKm8LqPA=; b=Dk2Avdv4vhYrfbYTZUQbJ1ukfq6iTaxvlOBCXdxotd2cffR7xFH28sdPD9hNSOxk58 ysNAYTDCH8jNUbn6MpLxxShiidl6kyNSIfzYdIISaW0bAuhbf+R0aRxXexdFMF/NGe9B kZPIDQQear8wAEv0pTi3isja4s2oJKBIcHiMh8o0k6IClhySkBgM7S8uKsP41/EM83dC dcq/GlRavWS5zKvbrSY1YhwktoVP4WbtL3H8p45aJE4+5FvheTw1UcwsuV53R4Re0Zj3 o1Ht5+0zQZm0TviyA9W0ZyY2FgVjTVBEXoUSdTG369vMDJyYqFMzqkg+VL6acoLPNIzi h9EQ== X-Received: by 10.50.197.170 with SMTP id iv10mr9532447igc.62.1364158139666; Sun, 24 Mar 2013 13:48:59 -0700 (PDT) Received: from [192.168.11.32] (eagleheights-105-50.resnet.wisc.edu. [146.151.105.50]) by mx.google.com with ESMTPS id hi4sm12859832igc.6.2013.03.24.13.48.58 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 24 Mar 2013 13:48:58 -0700 (PDT) From: Tapas Sarangi Content-Type: multipart/alternative; boundary="Apple-Mail=_E36118DA-9EF7-49CF-A71B-A56A6A3FB58B" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: disk used percentage is not symmetric on datanodes (balancer) Date: Sun, 24 Mar 2013 15:48:57 -0500 References: <522E52B1-497C-4D8D-9014-0182E8B9AABB@gmail.com> <19B0FB3B-40CF-435F-A120-3B4FBA83A9AF@gmail.com> <2068CE03-68B2-4AE6-9CD8-F590DD57C7E3@gmail.com> <9C2B8D8D-8A9E-4A50-B46B-1E00EC5F763E@gmail.com> To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_E36118DA-9EF7-49CF-A71B-A56A6A3FB58B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 On Mar 24, 2013, at 3:40 PM, Alexey Babutin = wrote: > you said that threshold=3D10.Run mannualy command : hadoop balancer = threshold 9.5 ,then 9 and so with 0.5 step. >=20 We are not setting threshold anywhere in our configuration and thus = considering the default which I believe is 10.=20 Why do you suggest such steps need to be tested for balancer ? Please = explain. I guess we had a discussion earlier on this thread and came to the = conclusion that the threshold will not help in this situation. ----- > On Sun, Mar 24, 2013 at 11:01 PM, Tapas Sarangi = wrote: > Yes, thanks for pointing, but I already know that it is completing the = balancing when exiting otherwise it shouldn't exit.=20 > Your answer doesn't solve the problem I mentioned earlier in my = message. 'hdfs' is stalling and hadoop is not writing unless space is = cleared up from the cluster even though "df" shows the cluster has about = 500 TB of free space.=20 >=20 > ------- > =20 >=20 > On Mar 24, 2013, at 1:54 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=B2= =E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=E0= =AE=A3=E0=AE=A9=E0=AF=8D) wrote: >=20 >> -setBalancerBandwidth >>=20 >> So the value is bytes per second. If it is running and exiting,it = means it has completed the balancing.=20 >>=20 >>=20 >> On 24 March 2013 11:32, Tapas Sarangi = wrote: >> Yes, we are running balancer, though a balancer process runs for = almost a day or more before exiting and starting over. >> Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume = that's bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? If = it is in Bits then we have a problem. >> What's the unit for "dfs.balance.bandwidthPerSec" ? >>=20 >> ----- >>=20 >> On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE= =B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF= =E0=AE=A3=E0=AE=A9=E0=AF=8D) wrote: >>=20 >>> Are you running balancer? If balancer is running and if it is slow, = try increasing the balancer bandwidth >>>=20 >>>=20 >>> On 24 March 2013 09:21, Tapas Sarangi = wrote: >>> Thanks for the follow up. I don't know whether attachment will pass = through this mailing list, but I am attaching a pdf that contains the = usage of all live nodes. >>>=20 >>> All nodes starting with letter "g" are the ones with smaller storage = space where as nodes starting with letter "s" have larger storage space. = As you will see, most of the "gXX" nodes are completely full whereas = "sXX" nodes have a lot of unused space.=20 >>>=20 >>> Recently, we are facing crisis frequently as 'hdfs' goes into a mode = where it is not able to write any further even though the total space = available in the cluster is about 500 TB. We believe this has something = to do with the way it is balancing the nodes, but don't understand the = problem yet. May be the attached PDF will help some of you (experts) to = see what is going wrong here... >>>=20 >>> Thanks >>> ------ >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>>=20 >>>> Balancer know about topology,but when calculate balancing it = operates only with nodes not with racks. >>>> You can see how it work in Balancer.java in BalancerDatanode about = string 509. >>>>=20 >>>> I was wrong about 350Tb,35Tb it calculates in such way : >>>>=20 >>>> For example: >>>> cluster_capacity=3D3.5Pb >>>> cluster_dfsused=3D2Pb >>>>=20 >>>> avgutil=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used = cluster capacity >>>> Then we know avg node utilization (node_dfsused/node_capacity*100) = .Balancer think that all good if avgutil = +10>node_utilizazation>=3Davgutil-10. >>>>=20 >>>> Ideal case that all node used avgutl of capacity.but for 12TB node = its only 6.5Tb and for 72Tb its about 40Tb. >>>>=20 >>>> Balancer cant help you. >>>>=20 >>>> Show me = http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=3DLIVE if you = can. >>>>=20 >>>> =20 >>>>=20 >>>>=20 >>>>> In ideal case with replication factor 2 ,with two nodes 12Tb and = 72Tb you will be able to have only 12Tb replication data. >>>>=20 >>>> Yes, this is true for exactly two nodes in the cluster with 12 TB = and 72 TB, but not true for more than two nodes in the cluster. >>>>=20 >>>>>=20 >>>>> Best way,on my opinion,it is using multiple racks.Nodes in rack = must be with identical capacity.Racks must be identical capacity. >>>>> For example: >>>>>=20 >>>>> rack1: 1 node with 72Tb >>>>> rack2: 6 nodes with 12Tb >>>>> rack3: 3 nodes with 24Tb >>>>>=20 >>>>> It helps with balancing,because dublicated block must be another = rack. >>>>>=20 >>>>=20 >>>> The same question I asked earlier in this message, does multiple = racks with default threshold for the balancer minimizes the difference = between racks ? >>>>=20 >>>>> Why did you select hdfs?May be lustre,cephfs and other is better = choise. =20 >>>>=20 >>>> It wasn't my decision, and I probably can't change it now. I am new = to this cluster and trying to understand few issues. I will explore = other options as you mentioned. >>>>=20 >>>> --=20 >>>> http://balajin.net/blog >>>> http://flic.kr/balajijegan >>=20 >>=20 >>=20 >>=20 >> --=20 >> http://balajin.net/blog >> http://flic.kr/balajijegan >=20 >=20 --Apple-Mail=_E36118DA-9EF7-49CF-A71B-A56A6A3FB58B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 zorlaxpokemonych@gmail.com&= gt; wrote:
you said that threshold=3D10.Run mannualy command : hadoop = balancer threshold 9.5 ,then 9 and so with 0.5 = step.


We are not setting = threshold anywhere in our configuration and thus considering the default = which I believe is 10. 
Why do you suggest such steps = need to be tested for balancer ? Please explain.
I guess we = had a discussion earlier on this thread and came to the conclusion that = the threshold will not help in this = situation.


-----




On Sun, Mar 24, 2013 at 11:01 PM, Tapas Sarangi = <tapas.sarangi@gmail.com> wrote:
Yes, thanks for pointing, but I already = know that it is completing the balancing when exiting otherwise it = shouldn't exit. 
Your answer doesn't solve the problem I mentioned earlier in my message. = 'hdfs' is stalling and hadoop is not writing unless space is cleared up = from the cluster even though "df" shows the cluster has about 500 TB of = free space. 

-------
 

On Mar 24, 2013, at 1:54 PM, = Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=B2=E0=AE=BE=E0=AE=9C=E0=AE=BF = =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=E0=AE=A3=E0=AE=A9=E0=AF=8D) = <balaji@balajin.net> wrote:

 -setBalancerBandwidth <bandwidth in = bytes per second>

So the value is bytes per second. If = it is running and exiting,it means it has completed the balancing.


On = 24 March 2013 11:32, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Yes, we are running balancer, though = a balancer process runs for almost a day or more before exiting and = starting over.
Current dfs.balance.bandwidthPerSec value is = set to 2x10^9. I assume that's bytes so about 2 GigaByte/sec. Shouldn't = that be reasonable ? If it is in Bits then we have a problem.
What's the unit for "dfs.balance.bandwidthPerSec" = ?

-----

On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE= =B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF= =E0=AE=A3=E0=AE=A9=E0=AF=8D) <lists@balajin.net> wrote:

Are you running balancer? If balancer is = running and if it is slow, try increasing the balancer bandwidth


On = 24 March 2013 09:21, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Thanks for the follow up. I don't = know whether attachment will pass through this mailing list, but I am = attaching a pdf that contains the usage of all live nodes.

All nodes starting with letter "g" are the ones with = smaller storage space where as nodes starting with letter "s" have = larger storage space. As you will see, most of the "gXX" nodes are = completely full whereas "sXX" nodes have a lot of unused = space. 

Recently, we are facing crisis frequently as 'hdfs' = goes into a mode where it is not able to write any further even though = the total space available in the cluster is about 500 TB. We believe = this has something to do with the way it is balancing the nodes, but = don't understand the problem yet. May be the attached PDF will help some = of you (experts) to see what is going wrong here...
=

Thanks
------


<= /div>





Balancer know about topology,but when calculate balancing it = operates only with nodes not with racks.
You can see how it work in = Balancer.java in  BalancerDatanode about string 509.

I was wrong about 350Tb,35Tb it calculates in such way :

For = example:
cluster_capacity=3D3.5Pb
cluster_dfsused=3D2Pb

avgut= il=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster = capacity
Then we know avg node utilization = (node_dfsused/node_capacity*100) .Balancer think that all good if  = avgutil +10>node_utilizazation>=3Davgutil-10.

Ideal case that all node used avgutl of capacity.but for 12TB node = its only 6.5Tb and for 72Tb its about 40Tb.

Balancer cant help = you.

Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNod= es=3DLIVE if you can.

 


In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb = you will be able to have only 12Tb replication = data.

Yes, this is true = for exactly two nodes in the cluster with 12 TB and 72 TB, but not true = for more than two nodes in the cluster.


Best way,on my opinion,it is using multiple racks.Nodes in rack must be with identical capacity.Racks must be identical capacity.
For example:

rack1: 1 node with 72Tb
rack2: 6 nodes with = 12Tb
rack3: 3 nodes with 24Tb

It helps with balancing,because = dublicated  block must be another = rack.


The same question I asked earlier in this message, does = multiple racks with default threshold for the balancer minimizes the = difference between racks ?

Why did you select hdfs?May be lustre,cephfs and other is better = choise. 

It wasn't my = decision, and I probably can't change it now. I am new to this cluster = and trying to understand few issues. I will explore other options as you = mentioned.

--
http://balajin.net/blog
http://flic.kr/balajijegan =




--
http://balajin.net/blog
http://flic.kr/balajijegan
=

= --Apple-Mail=_E36118DA-9EF7-49CF-A71B-A56A6A3FB58B--