Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C09B1F745 for ; Sun, 24 Mar 2013 21:34:46 +0000 (UTC) Received: (qmail 21295 invoked by uid 500); 24 Mar 2013 21:34:41 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 21157 invoked by uid 500); 24 Mar 2013 21:34:41 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 21149 invoked by uid 99); 24 Mar 2013 21:34:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Mar 2013 21:34:41 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jm15119b@gmail.com designates 209.85.219.52 as permitted sender) Received: from [209.85.219.52] (HELO mail-oa0-f52.google.com) (209.85.219.52) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Mar 2013 21:34:35 +0000 Received: by mail-oa0-f52.google.com with SMTP id k14so5711320oag.39 for ; Sun, 24 Mar 2013 14:34:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=3bl4U0el1GsuOaMpIeRwUjOmStUvxvcY28YulW9ZIEM=; b=zZ8EASYYbbqFAa7IpUs7sdpnpYOXKCk1Q5xxfc5PbLFytrAVVyCLBsHNPHTOkp0X/v 8R9F4eFFfSLSQ5syha2ZdvrRCUUbwHwHQgVxh/TaAhdwVbXvdwPwY3k19PXEJ3C57Dl9 GvpPdkBXR3EekWvsm1H6S9Wf2S+x/eMGAieY5RNA+mnblqNX1htuqTE7s0oVU/l8L8Mc Agx9PERNeBqoUd6eiCLkVukaC4sU5erQbwLX+6dPO/IrlI3brGmfR+BTeo2XCyX3voSM JVJpaFfGpSQqdZnynQSiEW1xRyK4P1zh70h+GVDSH7hbfuY9W33Vnb8h00VIwKw6rmt2 NBGA== MIME-Version: 1.0 X-Received: by 10.60.98.209 with SMTP id ek17mr740521oeb.132.1364160854407; Sun, 24 Mar 2013 14:34:14 -0700 (PDT) Received: by 10.182.51.73 with HTTP; Sun, 24 Mar 2013 14:34:14 -0700 (PDT) Received: by 10.182.51.73 with HTTP; Sun, 24 Mar 2013 14:34:14 -0700 (PDT) In-Reply-To: References: <522E52B1-497C-4D8D-9014-0182E8B9AABB@gmail.com> <19B0FB3B-40CF-435F-A120-3B4FBA83A9AF@gmail.com> <2068CE03-68B2-4AE6-9CD8-F590DD57C7E3@gmail.com> <9C2B8D8D-8A9E-4A50-B46B-1E00EC5F763E@gmail.com> <30A770CA-2647-4CDC-A474-94E0D8D08747@gmail.com> Date: Sun, 24 Mar 2013 17:34:14 -0400 Message-ID: Subject: Re: disk used percentage is not symmetric on datanodes (balancer) From: Jamal B To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e0115fac624c56e04d8b2732c X-Virus-Checked: Checked by ClamAV on apache.org --089e0115fac624c56e04d8b2732c Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable It shouldn't cause further problems since most of your small nodes are already their capacity. You could set or increase the dfs reserved property on your smaller nodes to force the flow of blocks onto the larger nodes. On Mar 24, 2013 4:45 PM, "Tapas Sarangi" wrote: > Hi, > > Thanks for the idea, I will give this a try and report back. > > My worry is if we decommission a small node (one at a time), will it move > the data to larger nodes or choke another smaller nodes ? In principle it > should distribute the blocks, the point is it is not distributing the way > we expect it to, so do you think this may cause further problems ? > > --------- > > On Mar 24, 2013, at 3:37 PM, Jamal B wrote: > > Then I think the only way around this would be to decommission 1 at a > time, the smaller nodes, and ensure that the blocks are moved to the larg= er > nodes. > > And once complete, bring back in the smaller nodes, but maybe only after > you tweak the rack topology to match your disk layout more than network > layout to compensate for the unbalanced nodes. > > > Just my 2 cents > > > On Sun, Mar 24, 2013 at 4:31 PM, Tapas Sarangi w= rote: > >> Thanks. We have a 1-1 configuration of drives and folder in all the >> datanodes. >> >> -Tapas >> >> On Mar 24, 2013, at 3:29 PM, Jamal B wrote: >> >> On both types of nodes, what is your dfs.data.dir set to? Does it specif= y >> multiple folders on the same set's of drives or is it 1-1 between folder >> and drive? If it's set to multiple folders on the same drives, it >> is probably multiplying the amount of "available capacity" incorrectly i= n >> that it assumes a 1-1 relationship between folder and total capacity of = the >> drive. >> >> >> On Sun, Mar 24, 2013 at 3:01 PM, Tapas Sarangi = wrote: >> >>> Yes, thanks for pointing, but I already know that it is completing the >>> balancing when exiting otherwise it shouldn't exit. >>> Your answer doesn't solve the problem I mentioned earlier in my message= . >>> 'hdfs' is stalling and hadoop is not writing unless space is cleared up >>> from the cluster even though "df" shows the cluster has about 500 TB of >>> free space. >>> >>> ------- >>> >>> >>> On Mar 24, 2013, at 1:54 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE= =B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE= =AF=E0=AE=A3=E0=AE=A9=E0=AF=8D) < >>> balaji@balajin.net> wrote: >>> >>> -setBalancerBandwidth >>> >>> So the value is bytes per second. If it is running and exiting,it means >>> it has completed the balancing. >>> >>> >>> On 24 March 2013 11:32, Tapas Sarangi wrote: >>> >>>> Yes, we are running balancer, though a balancer process runs for almos= t >>>> a day or more before exiting and starting over. >>>> Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume >>>> that's bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? I= f it >>>> is in Bits then we have a problem. >>>> What's the unit for "dfs.balance.bandwidthPerSec" ? >>>> >>>> ----- >>>> >>>> On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0= =AE=B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0= =AE=AF=E0=AE=A3=E0=AE=A9=E0=AF=8D) < >>>> lists@balajin.net> wrote: >>>> >>>> Are you running balancer? If balancer is running and if it is slow, tr= y >>>> increasing the balancer bandwidth >>>> >>>> >>>> On 24 March 2013 09:21, Tapas Sarangi wrote: >>>> >>>>> Thanks for the follow up. I don't know whether attachment will pass >>>>> through this mailing list, but I am attaching a pdf that contains the= usage >>>>> of all live nodes. >>>>> >>>>> All nodes starting with letter "g" are the ones with smaller storage >>>>> space where as nodes starting with letter "s" have larger storage spa= ce. As >>>>> you will see, most of the "gXX" nodes are completely full whereas "sX= X" >>>>> nodes have a lot of unused space. >>>>> >>>>> Recently, we are facing crisis frequently as 'hdfs' goes into a mode >>>>> where it is not able to write any further even though the total space >>>>> available in the cluster is about 500 TB. We believe this has somethi= ng to >>>>> do with the way it is balancing the nodes, but don't understand the p= roblem >>>>> yet. May be the attached PDF will help some of you (experts) to see w= hat is >>>>> going wrong here... >>>>> >>>>> Thanks >>>>> ------ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Balancer know about topology,but when calculate balancing it operates >>>>> only with nodes not with racks. >>>>> You can see how it work in Balancer.java in BalancerDatanode about >>>>> string 509. >>>>> >>>>> I was wrong about 350Tb,35Tb it calculates in such way : >>>>> >>>>> For example: >>>>> cluster_capacity=3D3.5Pb >>>>> cluster_dfsused=3D2Pb >>>>> >>>>> avgutil=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster >>>>> capacity >>>>> Then we know avg node utilization (node_dfsused/node_capacity*100) >>>>> .Balancer think that all good if avgutil >>>>> +10>node_utilizazation>=3Davgutil-10. >>>>> >>>>> Ideal case that all node used avgutl of capacity.but for 12TB node it= s >>>>> only 6.5Tb and for 72Tb its about 40Tb. >>>>> >>>>> Balancer cant help you. >>>>> >>>>> Show me >>>>> http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=3DLIVE if >>>>> you can. >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> In ideal case with replication factor 2 ,with two nodes 12Tb and >>>>>> 72Tb you will be able to have only 12Tb replication data. >>>>>> >>>>>> >>>>>> Yes, this is true for exactly two nodes in the cluster with 12 TB an= d >>>>>> 72 TB, but not true for more than two nodes in the cluster. >>>>>> >>>>>> >>>>>> Best way,on my opinion,it is using multiple racks.Nodes in rack must >>>>>> be with identical capacity.Racks must be identical capacity. >>>>>> For example: >>>>>> >>>>>> rack1: 1 node with 72Tb >>>>>> rack2: 6 nodes with 12Tb >>>>>> rack3: 3 nodes with 24Tb >>>>>> >>>>>> It helps with balancing,because dublicated block must be another >>>>>> rack. >>>>>> >>>>>> >>>>>> The same question I asked earlier in this message, does multiple >>>>>> racks with default threshold for the balancer minimizes the differen= ce >>>>>> between racks ? >>>>>> >>>>>> Why did you select hdfs?May be lustre,cephfs and other is better >>>>>> choise. >>>>>> >>>>>> >>>>>> It wasn't my decision, and I probably can't change it now. I am new >>>>>> to this cluster and trying to understand few issues. I will explore = other >>>>>> options as you mentioned. >>>>>> >>>>>> -- >>>>>> http://balajin.net/blog >>>>>> http://flic.kr/balajijegan >>>>>> >>>>> >>>> >>> >>> >>> -- >>> http://balajin.net/blog >>> http://flic.kr/balajijegan >>> >>> >>> >> >> > > --089e0115fac624c56e04d8b2732c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

It shouldn't cause further problems since most of your small nodes a= re already their capacity.=C2=A0 You could set or increase the dfs reserved= property on your smaller nodes to force the flow of blocks onto the larger= nodes.

On Mar 24, 2013 4:45 PM, "Tapas Sarangi&quo= t; <tapas.sarangi@gmail.com> wrote:
Hi,

Thanks for the i= dea, I will give this a try and report back.=C2=A0

My worry is if we decommission a small node (one at a time), will it move = the data to larger nodes or choke another smaller nodes ? In principle it s= hould distribute the blocks, the point is it is not distributing the way we= expect it to, so do you think this may cause further problems ?

---------


Then I think the only way around this would be to=C2=A0decommission=C2=A0 1= at a time, the smaller nodes, and ensure that the blocks are moved to the = larger nodes. =C2=A0
And once complete, bring back in the smaller nodes, but maybe only= after you tweak the rack topology to match your disk layout more than netw= ork layout to compensate for the unbalanced nodes. =C2=A0

Just my 2 cents

On Sun, Mar 24, 2013 at 4:31 PM, Tapas Sarangi= <tapas.sarangi@gmail.com> wrote:
Thanks. = We have a 1-1 configuration of drives and folder in all the datanodes.
=
-Tapas

On Mar 24, 2013, at = 3:29 PM, Jamal B <jm15119b@gmail.com> wrote:

On both types of nodes, what is your dfs.data.dir set to? = Does it specify multiple folders on the same set's of drives or is it 1= -1 between folder and drive? =C2=A0If it's set to multiple folders on t= he same drives, it is=C2=A0probably=C2=A0multiplying the amount of "av= ailable capacity" incorrectly in that it assumes a 1-1 relationship be= tween folder and total capacity of the drive.


On Sun, Mar 2= 4, 2013 at 3:01 PM, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Yes, tha= nks for pointing, but I already know that it is completing the balancing wh= en exiting otherwise it shouldn't exit.=C2=A0
Your answer doesn't solve the problem I mentioned earlier in my message= . 'hdfs' is stalling and hadoop is not writing unless space is clea= red up from the cluster even though "df" shows the cluster has ab= out 500 TB of free space.=C2=A0

-------
=C2=A0

On Mar 24, 2013, = at 1:54 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=B2=E0=AE=BE=E0=AE=9C= =E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=E0=AE=A3=E0=AE=A9= =E0=AF=8D) <bala= ji@balajin.net> wrote:

=C2=A0-setBalancerBandwidth <b= andwidth in bytes per second>

So the value is bytes per se= cond. If it is running and exiting,it means it has completed the balancing.=


On 24 M= arch 2013 11:32, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Yes= , we are running balancer, though a balancer process runs for almost a day = or more before exiting and starting over.
Current=C2=A0dfs.balance.bandwidthPerSec=C2=A0value is set to 2x10^9= . I assume that's bytes so about 2 GigaByte/sec. Shouldn't that be = reasonable ?=C2=A0If it is in Bits then we have a problem.
What's the unit for "dfs.balance.bandwidthPerSec" ?

-----

On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0= =AE=B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0= =AE=AF=E0=AE=A3=E0=AE=A9=E0=AF=8D) <lists@balajin.net> wrote:

Are you running balancer? If balancer is runni= ng and if it is slow, try increasing the balancer bandwidth


On 24 M= arch 2013 09:21, Tapas Sarangi <tapas.sarangi@gmail.com> wrote:
Tha= nks for the follow up. I don't know whether attachment will pass throug= h this mailing list, but I am attaching a pdf that contains the usage of al= l live nodes.

All nodes starting with letter "g" are the on= es with smaller storage space where as nodes starting with letter "s&q= uot; have larger storage space. As you will see, most of the "gXX"= ; nodes are completely full whereas "sXX" nodes have a lot of unu= sed space.=C2=A0

Recently, we are facing crisis frequently as 'hdfs&= #39; goes into a mode where it is not able to write any further even though= the total space available in the cluster is about 500 TB. We believe this = has something to do with the way it is balancing the nodes, but don't u= nderstand the problem yet. May be the attached PDF will help some of you (e= xperts) to see what is going wrong here...

Thanks
------







Balancer know about topology,but when calculate balancing it opera= tes only with nodes not with racks.
You can see how it work in Balancer.= java in=C2=A0 BalancerDatanode about string 509.

I was wrong about 350Tb,35Tb it calculates in such way :

For exa= mple:
cluster_capacity=3D3.5Pb
cluster_dfsused=3D2Pb

avgutil= =3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster capacity
T= hen we know avg node utilization (node_dfsused/node_capacity*100) .Balancer= think that all good if=C2=A0 avgutil +10>node_utilizazation>=3Davgut= il-10.

Ideal case that all node used avgutl of capacity.but for 12TB node its = only 6.5Tb and for 72Tb its about 40Tb.

Balancer cant help you.
<= br>Show me http://namenode.rambler.ru:50070/dfsnodelis= t.jsp?whatNodes=3DLIVE if you can.

=C2=A0


In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb you w= ill be able to have only 12Tb replication data.
=
Yes, this is true for exactly two nodes in the cluster= with 12 TB and 72 TB, but not true for more than two nodes in the cluster.=


Best way,on my opinion,it is using multiple racks.Nodes in rack must be with identical capacity.Racks must be identical capacity.
For example:

rack1: 1 node with 72Tb
rack2: 6 nodes with 12Tb
= rack3: 3 nodes with 24Tb

It helps with balancing,because dublicated= =C2=A0 block must be another rack.


The same question I asked earlier in this message, does mu= ltiple racks with default threshold for the balancer minimizes the differen= ce between racks ?

Why did you select hdfs?May be lustre,cephfs and other is better chois= e.=C2=A0

It wasn't my decisio= n, and I probably can't change it now. I am new to this cluster and try= ing to understand few issues. I will explore other options as you mentioned= .

--
http://bal= ajin.net/blog
http://flic.kr/balajijegan
=




--
http://ba= lajin.net/blog
http://flic.kr/balajijegan





--089e0115fac624c56e04d8b2732c--