Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A7674F79E for ; Sun, 24 Mar 2013 18:33:00 +0000 (UTC) Received: (qmail 61421 invoked by uid 500); 24 Mar 2013 18:32:56 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 61217 invoked by uid 500); 24 Mar 2013 18:32:55 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 61210 invoked by uid 99); 24 Mar 2013 18:32:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Mar 2013 18:32:55 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tapas.sarangi@gmail.com designates 209.85.210.170 as permitted sender) Received: from [209.85.210.170] (HELO mail-ia0-f170.google.com) (209.85.210.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Mar 2013 18:32:51 +0000 Received: by mail-ia0-f170.google.com with SMTP id h8so4935271iaa.1 for ; Sun, 24 Mar 2013 11:32:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:content-type:message-id:mime-version:subject:date :references:to:in-reply-to:x-mailer; bh=v3wwnJypFSTXobgmRvF3T1owHVjXOEf7H3x6sW9F/w0=; b=Dh1WcvUfHYYFUejoAzMK/wduiY0aG+gItWi1aFQbqf3dIzBsUMYk0u9o3TDRVBuxOz QgqAmRlRj8MQxD0Hr4lK6ad8mYajUiBUAZRE1bLQq9s7tBi/iln6NnWHXnOiBZnabrft 0gRyxjY3j38JhhmhD8fXz6aL+yjQc1p5KeOzhMW8uXo+ozdqZYYKJsmHkpVjemYV2erx JbFsQrCUEl+DTBJUPT+BCNImFc6QP0c0Effce8BDKf7IrlykMqG5mjZZCy4/eZD6k53f efMo4ftq5HPrHhF7cjoYUyNXiq+m6gQZqnSyBJ5LXgIYXs4OnPmh61IkBF3ECHo+8MVc U3CA== X-Received: by 10.50.7.240 with SMTP id m16mr5994559iga.91.1364149951052; Sun, 24 Mar 2013 11:32:31 -0700 (PDT) Received: from [192.168.11.32] (eagleheights-105-50.resnet.wisc.edu. [146.151.105.50]) by mx.google.com with ESMTPS id qs4sm10895616igb.10.2013.03.24.11.32.29 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 24 Mar 2013 11:32:30 -0700 (PDT) From: Tapas Sarangi Content-Type: multipart/alternative; boundary="Apple-Mail=_F80E51A3-AC75-4651-8492-4A9F5808D284" Message-Id: <2068CE03-68B2-4AE6-9CD8-F590DD57C7E3@gmail.com> Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: disk used percentage is not symmetric on datanodes (balancer) Date: Sun, 24 Mar 2013 13:32:28 -0500 References: <522E52B1-497C-4D8D-9014-0182E8B9AABB@gmail.com> <19B0FB3B-40CF-435F-A120-3B4FBA83A9AF@gmail.com> To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_F80E51A3-AC75-4651-8492-4A9F5808D284 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Yes, we are running balancer, though a balancer process runs for almost = a day or more before exiting and starting over. Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume = that's bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? If = it is in Bits then we have a problem. What's the unit for "dfs.balance.bandwidthPerSec" ? ----- On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=B2= =E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=E0= =AE=A3=E0=AE=A9=E0=AF=8D) wrote: > Are you running balancer? If balancer is running and if it is slow, = try increasing the balancer bandwidth >=20 >=20 > On 24 March 2013 09:21, Tapas Sarangi wrote: > Thanks for the follow up. I don't know whether attachment will pass = through this mailing list, but I am attaching a pdf that contains the = usage of all live nodes. >=20 > All nodes starting with letter "g" are the ones with smaller storage = space where as nodes starting with letter "s" have larger storage space. = As you will see, most of the "gXX" nodes are completely full whereas = "sXX" nodes have a lot of unused space.=20 >=20 > Recently, we are facing crisis frequently as 'hdfs' goes into a mode = where it is not able to write any further even though the total space = available in the cluster is about 500 TB. We believe this has something = to do with the way it is balancing the nodes, but don't understand the = problem yet. May be the attached PDF will help some of you (experts) to = see what is going wrong here... >=20 > Thanks > ------ >=20 >=20 >=20 >=20 >=20 >=20 >>=20 >> Balancer know about topology,but when calculate balancing it operates = only with nodes not with racks. >> You can see how it work in Balancer.java in BalancerDatanode about = string 509. >>=20 >> I was wrong about 350Tb,35Tb it calculates in such way : >>=20 >> For example: >> cluster_capacity=3D3.5Pb >> cluster_dfsused=3D2Pb >>=20 >> avgutil=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster = capacity >> Then we know avg node utilization (node_dfsused/node_capacity*100) = .Balancer think that all good if avgutil = +10>node_utilizazation>=3Davgutil-10. >>=20 >> Ideal case that all node used avgutl of capacity.but for 12TB node = its only 6.5Tb and for 72Tb its about 40Tb. >>=20 >> Balancer cant help you. >>=20 >> Show me = http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=3DLIVE if you = can. >>=20 >> =20 >>=20 >>=20 >>> In ideal case with replication factor 2 ,with two nodes 12Tb and = 72Tb you will be able to have only 12Tb replication data. >>=20 >> Yes, this is true for exactly two nodes in the cluster with 12 TB and = 72 TB, but not true for more than two nodes in the cluster. >>=20 >>>=20 >>> Best way,on my opinion,it is using multiple racks.Nodes in rack must = be with identical capacity.Racks must be identical capacity. >>> For example: >>>=20 >>> rack1: 1 node with 72Tb >>> rack2: 6 nodes with 12Tb >>> rack3: 3 nodes with 24Tb >>>=20 >>> It helps with balancing,because dublicated block must be another = rack. >>>=20 >>=20 >> The same question I asked earlier in this message, does multiple = racks with default threshold for the balancer minimizes the difference = between racks ? >>=20 >>> Why did you select hdfs?May be lustre,cephfs and other is better = choise. =20 >>=20 >> It wasn't my decision, and I probably can't change it now. I am new = to this cluster and trying to understand few issues. I will explore = other options as you mentioned. >>=20 >> --=20 >> http://balajin.net/blog >> http://flic.kr/balajijegan --Apple-Mail=_F80E51A3-AC75-4651-8492-4A9F5808D284 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

-----

On Mar = 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=B2=E0=AE=BE= =E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=E0=AE=A3=E0= =AE=A9=E0=AF=8D) <lists@balajin.net> = wrote:

Are you running balancer? If balancer is = running and if it is slow, try increasing the balancer = bandwidth


On 24 March 2013 09:21, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Thanks for the follow up. I don't = know whether attachment will pass through this mailing list, but I am = attaching a pdf that contains the usage of all live nodes.

All nodes starting with letter "g" are the ones with = smaller storage space where as nodes starting with letter "s" have = larger storage space. As you will see, most of the "gXX" nodes are = completely full whereas "sXX" nodes have a lot of unused = space. 

Recently, we are facing crisis frequently as 'hdfs' = goes into a mode where it is not able to write any further even though = the total space available in the cluster is about 500 TB. We believe = this has something to do with the way it is balancing the nodes, but = don't understand the problem yet. May be the attached PDF will help some = of you (experts) to see what is going wrong here...
=

Thanks
------


<= /div>





Balancer know about topology,but when calculate balancing it = operates only with nodes not with racks.
You can see how it work in = Balancer.java in  BalancerDatanode about string 509.

I was wrong about 350Tb,35Tb it calculates in such way :

For = example:
cluster_capacity=3D3.5Pb
cluster_dfsused=3D2Pb

avgut= il=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster = capacity
Then we know avg node utilization = (node_dfsused/node_capacity*100) .Balancer think that all good if  = avgutil +10>node_utilizazation>=3Davgutil-10.

Ideal case that all node used avgutl of capacity.but for 12TB node = its only 6.5Tb and for 72Tb its about 40Tb.

Balancer cant help = you.

Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNod= es=3DLIVE if you can.

 


In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb = you will be able to have only 12Tb replication = data.

Yes, this is true = for exactly two nodes in the cluster with 12 TB and 72 TB, but not true = for more than two nodes in the cluster.


Best way,on my opinion,it is using multiple racks.Nodes in rack must be with identical capacity.Racks must be identical capacity.
For example:

rack1: 1 node with 72Tb
rack2: 6 nodes with = 12Tb
rack3: 3 nodes with 24Tb

It helps with balancing,because = dublicated  block must be another = rack.


The same question I asked earlier in this message, does = multiple racks with default threshold for the balancer minimizes the = difference between racks ?

Why did you select hdfs?May be lustre,cephfs and other is better = choise. 

It wasn't my = decision, and I probably can't change it now. I am new to this cluster = and trying to understand few issues. I will explore other options as you = mentioned.

--
http://balajin.net/blog
http://flic.kr/balajijegan =

= --Apple-Mail=_F80E51A3-AC75-4651-8492-4A9F5808D284--