Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 75F17F31C for ; Fri, 22 Mar 2013 16:06:05 +0000 (UTC) Received: (qmail 80623 invoked by uid 500); 22 Mar 2013 16:06:00 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 80481 invoked by uid 500); 22 Mar 2013 16:06:00 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 80474 invoked by uid 99); 22 Mar 2013 16:06:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Mar 2013 16:06:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zorlaxpokemonych@gmail.com designates 209.85.128.52 as permitted sender) Received: from [209.85.128.52] (HELO mail-qe0-f52.google.com) (209.85.128.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Mar 2013 16:05:55 +0000 Received: by mail-qe0-f52.google.com with SMTP id jy17so796788qeb.11 for ; Fri, 22 Mar 2013 09:05:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=yRghHEa2QZbTTfMHMvFHvG1hwOLfSQW4Q/sybg6cdzg=; b=X1lyXrLsShQa31K9zuVerXxXcryv8NTqee3z3RZB+OrP5G9sQJB05wH4izfvlZbcKR unxhn3xPre3Pzrh5AzCuq7gbwkwGwCa1yEna02MpH5ikDbmeZa9oHyQA9iDf8Uv24TOl Ja5r678bLKxuKq7zmAxaaSD4MkMUTck8/auJDcF0ucdCt1CBFfhIw16m5eo5hM7JwVmz NKedNGqjwqCQjzjWbtRq522PBKJEehr1kCXzNRF6uYIHjyNHwPq+mmRqlSXT1HJoF2R8 xghDvDy5yXpfdGjOAL+qdgxMGwn8VYRvSOcjaY0+qoKmUV5r6NOIf67vtxYzANGwSz7p CcsA== MIME-Version: 1.0 X-Received: by 10.224.185.148 with SMTP id co20mr424176qab.94.1363968334393; Fri, 22 Mar 2013 09:05:34 -0700 (PDT) Received: by 10.49.14.65 with HTTP; Fri, 22 Mar 2013 09:05:34 -0700 (PDT) In-Reply-To: References: <522E52B1-497C-4D8D-9014-0182E8B9AABB@gmail.com> Date: Fri, 22 Mar 2013 20:05:34 +0400 Message-ID: Subject: Re: disk used percentage is not symmetric on datanodes (balancer) From: =?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JHQsNCx0YPRgtC40L0=?= To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=485b397dd6970e759004d885a093 X-Virus-Checked: Checked by ClamAV on apache.org --485b397dd6970e759004d885a093 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable 2013/3/20 Tapas Sarangi > Thanks for your reply. Some follow up questions below : > > On Mar 20, 2013, at 5:35 AM, =D0=90=D0=BB=D0=B5=D0=BA=D1=81=D0=B5=D0=B9 = =D0=91=D0=B0=D0=B1=D1=83=D1=82=D0=B8=D0=BD > wrote: > > > > dfs.balance.bandwidthPerSec in hdfs-site.xml.I think balancer cant help > you,because it makes all the nodes equal.They can differ only on balancer > threshold.Threshold =3D10 by default.It means,that nodes can differ up to > 350Tb between each other in 3.5Pb cluster.If Threshold =3D1 up to 35Tb an= d so > on. > > > If we use multiple racks, let's assume we have 10 racks now and they are > equally divided in size (350 TB each). With a default threshold of 10, an= y > two nodes on a given rack will have a maximum difference of 35 TB, is thi= s > correct ? Also, does this mean the difference between any two racks will > also go down to 35 TB ? > Balancer know about topology,but when calculate balancing it operates only with nodes not with racks. You can see how it work in Balancer.java in BalancerDatanode about string 509. I was wrong about 350Tb,35Tb it calculates in such way : For example: cluster_capacity=3D3.5Pb cluster_dfsused=3D2Pb avgutil=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster capaci= ty Then we know avg node utilization (node_dfsused/node_capacity*100) .Balancer think that all good if avgutil +10>node_utilizazation>=3Davgutil-10. Ideal case that all node used avgutl of capacity.but for 12TB node its only 6.5Tb and for 72Tb its about 40Tb. Balancer cant help you. Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=3DLIVE i= f you can. > > > In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb you > will be able to have only 12Tb replication data. > > > Yes, this is true for exactly two nodes in the cluster with 12 TB and 72 > TB, but not true for more than two nodes in the cluster. > > > Best way,on my opinion,it is using multiple racks.Nodes in rack must be > with identical capacity.Racks must be identical capacity. > For example: > > rack1: 1 node with 72Tb > rack2: 6 nodes with 12Tb > rack3: 3 nodes with 24Tb > > It helps with balancing,because dublicated block must be another rack. > > > The same question I asked earlier in this message, does multiple racks > with default threshold for the balancer minimizes the difference between > racks ? > > Why did you select hdfs?May be lustre,cephfs and other is better choise. > > > It wasn't my decision, and I probably can't change it now. I am new to > this cluster and trying to understand few issues. I will explore other > options as you mentioned. > > > --485b397dd6970e759004d885a093 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

2013/3/20 Tapas Sarangi <tapas.sa= rangi@gmail.com>
Thanks for your reply. Some follow up q= uestions below :

On Mar 20, 2013, = at 5:35 AM, =D0=90=D0=BB=D0=B5=D0=BA=D1=81=D0=B5=D0=B9 =D0=91=D0=B0=D0=B1= =D1=83=D1=82=D0=B8=D0=BD <zorlaxpokemonych@gmail.com> wrote:

=C2=A0
<= div>dfs.balance.bandwidthPerSec in hd= fs-site.xml.I think balancer cant help you,because it makes all the nodes equal.They=20 can differ only on balancer threshold.Threshold =3D10 by default.It=20 means,that nodes can differ up to 350Tb between each other in 3.5Pb=20 cluster.If Threshold =3D1 up to 35Tb and so on.

If we use multiple racks, let's assume we ha= ve 10 racks now and they are equally divided in size (350 TB each). With a = default threshold of 10, any two nodes on a given rack will have a maximum = difference of 35 TB, is this correct ? Also, does this mean the difference = between any two racks will also go down to 35 TB ?

Balancer know about topology,= but when calculate balancing it operates only with nodes not with racks.You can see how it work in Balancer.java in=C2=A0 BalancerDatanode about s= tring 509.

I was wrong about 350Tb,35Tb it calculates in such way :

For exa= mple:
cluster_capacity=3D3.5Pb
cluster_dfsused=3D2Pb

avgutil= =3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster capacity
T= hen we know avg node utilization (node_dfsused/node_capacity*100) .Balancer= think that all good if=C2=A0 avgutil +10>node_utilizazation>=3Davgut= il-10.

Ideal case that all node used avgutl of capacity.but for 12TB node its = only 6.5Tb and for 72Tb its about 40Tb.

Balancer cant help you.
<= br>Show me http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=3D= LIVE if you can.

=C2=A0


=
In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb you w= ill be able to have only 12Tb replication data.

Yes, this is true for exactly two nodes in the c= luster with 12 TB and 72 TB, but not true for more than two nodes in the cl= uster.

=

Best way,on my opinion,it is using multiple racks.Nodes in rack must be with identical capacity.Racks must be identical capacity.
For example:

rack1: 1 node with 72Tb
rack2: 6 nodes with 12Tb
= rack3: 3 nodes with 24Tb

It helps with balancing,because dublicated= =C2=A0 block must be another rack.

The same question I asked earlier in this message, does mu= ltiple racks with default threshold for the balancer minimizes the differen= ce between racks ?

Why did you select hdfs?May be lustre,cephfs and other is better chois= e.=C2=A0

It wasn't my decisio= n, and I probably can't change it now. I am new to this cluster and try= ing to understand few issues. I will explore other options as you mentioned= .


--485b397dd6970e759004d885a093--