Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of tapas.sarangi@gmail.com
 designates 209.85.210.180 as permitted sender)
From: Tapas Sarangi <tapas.sarangi@gmail.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_E36118DA-9EF7-49CF-A71B-A56A6A3FB58B"
Message-Id: <B3C5161F-ED15-460C-AAE1-26B42BD786E0@gmail.com>
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: disk used percentage is not symmetric on datanodes (balancer)
Date: Sun, 24 Mar 2013 15:48:57 -0500
References: <522E52B1-497C-4D8D-9014-0182E8B9AABB@gmail.com>
 <CAMb3GvimJrv=R+a4qQNOqt8pkitGbnBNTS75K-1T2qhgcgK9mA@mail.gmail.com>
 <D13C7684-7FA4-4912-8CA9-A664FB506099@gmail.com>
 <CAMb3Gvi+soFDc3cQ39pUcOT-=AELwPokg0buJVVDm-frv7B8KQ@mail.gmail.com>
 <C9B76FD4-6FDB-4BA6-BDCA-70D3E5AA2B15@gmail.com>
 <CAMb3GviC1aoZpaex_TC8hv3nf_gFoGj23AuMysEiwyg-JJMpGQ@mail.gmail.com>
 <19B0FB3B-40CF-435F-A120-3B4FBA83A9AF@gmail.com>
 <CACvhJWdifPdDT6s9mVhkNMq_9_aV0pLyPELZUVxWhjYQ3OK0+g@mail.gmail.com>
 <2068CE03-68B2-4AE6-9CD8-F590DD57C7E3@gmail.com>
 <CACvhJWcVbz4a54g8CN_uT9AL4nbe3MLaBxnCRDCdLzp_xCsLzQ@mail.gmail.com>
 <9C2B8D8D-8A9E-4A50-B46B-1E00EC5F763E@gmail.com>
 <CAMb3GvhE_DJLeR-BGSOu49S7Ag0npgGZZdR+uKpbxsvpER478w@mail.gmail.com>
To: user@hadoop.apache.org
In-Reply-To: 
 <CAMb3GvhE_DJLeR-BGSOu49S7Ag0npgGZZdR+uKpbxsvpER478w@mail.gmail.com>


--Apple-Mail=_E36118DA-9EF7-49CF-A71B-A56A6A3FB58B
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8


On Mar 24, 2013, at 3:40 PM, Alexey Babutin <zorlaxpokemonych@gmail.com> =
wrote:

> you said that threshold=3D10.Run mannualy command : hadoop balancer =
threshold 9.5 ,then 9 and so with 0.5 step.
>=20

We are not setting threshold anywhere in our configuration and thus =
considering the default which I believe is 10.=20
Why do you suggest such steps need to be tested for balancer ? Please =
explain.
I guess we had a discussion earlier on this thread and came to the =
conclusion that the threshold will not help in this situation.


-----


> On Sun, Mar 24, 2013 at 11:01 PM, Tapas Sarangi =
<tapas.sarangi@gmail.com> wrote:
> Yes, thanks for pointing, but I already know that it is completing the =
balancing when exiting otherwise it shouldn't exit.=20
> Your answer doesn't solve the problem I mentioned earlier in my =
message. 'hdfs' is stalling and hadoop is not writing unless space is =
cleared up from the cluster even though "df" shows the cluster has about =
500 TB of free space.=20
>=20
> -------
> =20
>=20
> On Mar 24, 2013, at 1:54 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=B2=
=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=E0=
=AE=A3=E0=AE=A9=E0=AF=8D) <balaji@balajin.net> wrote:
>=20
>>  -setBalancerBandwidth <bandwidth in bytes per second>
>>=20
>> So the value is bytes per second. If it is running and exiting,it =
means it has completed the balancing.=20
>>=20
>>=20
>> On 24 March 2013 11:32, Tapas Sarangi <tapas.sarangi@gmail.com> =
wrote:
>> Yes, we are running balancer, though a balancer process runs for =
almost a day or more before exiting and starting over.
>> Current dfs.balance.bandwidthPerSec value is set to 2x10^9. I assume =
that's bytes so about 2 GigaByte/sec. Shouldn't that be reasonable ? If =
it is in Bits then we have a problem.
>> What's the unit for "dfs.balance.bandwidthPerSec" ?
>>=20
>> -----
>>=20
>> On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=
=B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=
=E0=AE=A3=E0=AE=A9=E0=AF=8D) <lists@balajin.net> wrote:
>>=20
>>> Are you running balancer? If balancer is running and if it is slow, =
try increasing the balancer bandwidth
>>>=20
>>>=20
>>> On 24 March 2013 09:21, Tapas Sarangi <tapas.sarangi@gmail.com> =
wrote:
>>> Thanks for the follow up. I don't know whether attachment will pass =
through this mailing list, but I am attaching a pdf that contains the =
usage of all live nodes.
>>>=20
>>> All nodes starting with letter "g" are the ones with smaller storage =
space where as nodes starting with letter "s" have larger storage space. =
As you will see, most of the "gXX" nodes are completely full whereas =
"sXX" nodes have a lot of unused space.=20
>>>=20
>>> Recently, we are facing crisis frequently as 'hdfs' goes into a mode =
where it is not able to write any further even though the total space =
available in the cluster is about 500 TB. We believe this has something =
to do with the way it is balancing the nodes, but don't understand the =
problem yet. May be the attached PDF will help some of you (experts) to =
see what is going wrong here...
>>>=20
>>> Thanks
>>> ------
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>>=20
>>>> Balancer know about topology,but when calculate balancing it =
operates only with nodes not with racks.
>>>> You can see how it work in Balancer.java in  BalancerDatanode about =
string 509.
>>>>=20
>>>> I was wrong about 350Tb,35Tb it calculates in such way :
>>>>=20
>>>> For example:
>>>> cluster_capacity=3D3.5Pb
>>>> cluster_dfsused=3D2Pb
>>>>=20
>>>> avgutil=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used =
cluster capacity
>>>> Then we know avg node utilization (node_dfsused/node_capacity*100) =
.Balancer think that all good if  avgutil =
+10>node_utilizazation>=3Davgutil-10.
>>>>=20
>>>> Ideal case that all node used avgutl of capacity.but for 12TB node =
its only 6.5Tb and for 72Tb its about 40Tb.
>>>>=20
>>>> Balancer cant help you.
>>>>=20
>>>> Show me =
http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=3DLIVE if you =
can.
>>>>=20
>>>> =20
>>>>=20
>>>>=20
>>>>> In ideal case with replication factor 2 ,with two nodes 12Tb and =
72Tb you will be able to have only 12Tb replication data.
>>>>=20
>>>> Yes, this is true for exactly two nodes in the cluster with 12 TB =
and 72 TB, but not true for more than two nodes in the cluster.
>>>>=20
>>>>>=20
>>>>> Best way,on my opinion,it is using multiple racks.Nodes in rack =
must be with identical capacity.Racks must be identical capacity.
>>>>> For example:
>>>>>=20
>>>>> rack1: 1 node with 72Tb
>>>>> rack2: 6 nodes with 12Tb
>>>>> rack3: 3 nodes with 24Tb
>>>>>=20
>>>>> It helps with balancing,because dublicated  block must be another =
rack.
>>>>>=20
>>>>=20
>>>> The same question I asked earlier in this message, does multiple =
racks with default threshold for the balancer minimizes the difference =
between racks ?
>>>>=20
>>>>> Why did you select hdfs?May be lustre,cephfs and other is better =
choise. =20
>>>>=20
>>>> It wasn't my decision, and I probably can't change it now. I am new =
to this cluster and trying to understand few issues. I will explore =
other options as you mentioned.
>>>>=20
>>>> --=20
>>>> http://balajin.net/blog
>>>> http://flic.kr/balajijegan
>>=20
>>=20
>>=20
>>=20
>> --=20
>> http://balajin.net/blog
>> http://flic.kr/balajijegan
>=20
>=20


--Apple-Mail=_E36118DA-9EF7-49CF-A71B-A56A6A3FB58B
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><br><div><div>On Mar 24, 2013, at 3:40 PM, Alexey Babutin &lt;<a =
href=3D"mailto:zorlaxpokemonych@gmail.com">zorlaxpokemonych@gmail.com</a>&=
gt; wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">you said that threshold=3D10.Run mannualy command : hadoop =
balancer threshold 9.5 ,then 9 and so with 0.5 =
step.<br><br></blockquote><div><br></div><div>We are not setting =
threshold anywhere in our configuration and thus considering the default =
which I believe is 10.&nbsp;</div><div>Why do you suggest such steps =
need to be tested for balancer ? Please explain.</div><div>I guess we =
had a discussion earlier on this thread and came to the conclusion that =
the threshold will not help in this =
situation.</div><div><br></div><div><br></div><div>-----</div><div><br></d=
iv><div><br></div><div><br></div><br><blockquote type=3D"cite"><div =
class=3D"gmail_quote">On Sun, Mar 24, 2013 at 11:01 PM, Tapas Sarangi =
<span dir=3D"ltr">&lt;<a href=3D"mailto:tapas.sarangi@gmail.com" =
target=3D"_blank">tapas.sarangi@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word">Yes, thanks for pointing, but I already =
know that it is completing the balancing when exiting otherwise it =
shouldn't exit.&nbsp;<div>
Your answer doesn't solve the problem I mentioned earlier in my message. =
'hdfs' is stalling and hadoop is not writing unless space is cleared up =
from the cluster even though "df" shows the cluster has about 500 TB of =
free space.&nbsp;<div>
<br></div><div>-------</div><div><div =
class=3D"h5">&nbsp;<div><br><div><div>On Mar 24, 2013, at 1:54 PM, =
Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =
=E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=E0=AE=A3=E0=AE=A9=E0=AF=8D) =
&lt;<a href=3D"mailto:balaji@balajin.net" =
target=3D"_blank">balaji@balajin.net</a>&gt; wrote:</div>
<br><blockquote type=3D"cite"><div dir=3D"ltr"><span =
style=3D"font-family:'Times New =
Roman';font-size:large">&nbsp;-setBalancerBandwidth &lt;bandwidth in =
bytes per second&gt;<br></span><br>So the value is bytes per second. If =
it is running and exiting,it means it has completed the balancing. <br>

</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On =
24 March 2013 11:32, Tapas Sarangi <span dir=3D"ltr">&lt;<a =
href=3D"mailto:tapas.sarangi@gmail.com" =
target=3D"_blank">tapas.sarangi@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word"><div>Yes, we are running balancer, though =
a balancer process runs for almost a day or more before exiting and =
starting over.</div>

<div>Current&nbsp;<span style=3D"font-family:'Times New =
Roman';font-size:large">dfs.balance.bandwidthPerSec</span>&nbsp;value is =
set to 2x10^9. I assume that's bytes so about 2 GigaByte/sec. Shouldn't =
that be reasonable ?&nbsp;If it is in Bits then we have a problem.</div>

<div>What's the unit for "<span style=3D"font-family:'Times New =
Roman';font-size:large">dfs.balance.bandwidthPerSec</span>" =
?</div><div><br></div><div>-----</div><div><div><br></div><div>
<div>On Mar 24, 2013, at 1:23 PM, Balaji Narayanan (=E0=AE=AA=E0=AE=BE=E0=AE=
=B2=E0=AE=BE=E0=AE=9C=E0=AE=BF =E0=AE=A8=E0=AE=BE=E0=AE=B0=E0=AE=BE=E0=AE=AF=
=E0=AE=A3=E0=AE=A9=E0=AF=8D) &lt;<a href=3D"mailto:lists@balajin.net" =
target=3D"_blank">lists@balajin.net</a>&gt; wrote:</div><br><blockquote =
type=3D"cite"><div dir=3D"ltr">Are you running balancer? If balancer is =
running and if it is slow, try increasing the balancer bandwidth<br>

</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On =
24 March 2013 09:21, Tapas Sarangi <span dir=3D"ltr">&lt;<a =
href=3D"mailto:tapas.sarangi@gmail.com" =
target=3D"_blank">tapas.sarangi@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word"><div>Thanks for the follow up. I don't =
know whether attachment will pass through this mailing list, but I am =
attaching a pdf that contains the usage of all live nodes.</div>


<div><br></div><div>All nodes starting with letter "g" are the ones with =
smaller storage space where as nodes starting with letter "s" have =
larger storage space. As you will see, most of the "gXX" nodes are =
completely full whereas "sXX" nodes have a lot of unused =
space.&nbsp;</div>


<div><br></div><div>Recently, we are facing crisis frequently as 'hdfs' =
goes into a mode where it is not able to write any further even though =
the total space available in the cluster is about 500 TB. We believe =
this has something to do with the way it is balancing the nodes, but =
don't understand the problem yet. May be the attached PDF will help some =
of you (experts) to see what is going wrong here...</div>


=
<div><br></div><div>Thanks</div><div>------</div><div><br></div><div><br><=
/div><div></div></div><br><div =
style=3D"word-wrap:break-word"><div></div><div><br></div><div><br></div><d=
iv><br><blockquote type=3D"cite"><div class=3D"gmail_quote">


<div><br>Balancer know about topology,but when calculate balancing it =
operates only with nodes not with racks.<br>You can see how it work in =
Balancer.java in&nbsp; BalancerDatanode about string 509.<br>
<br>I was wrong about 350Tb,35Tb it calculates in such way :<br><br>For =
example:<br>cluster_capacity=3D3.5Pb<br>cluster_dfsused=3D2Pb<br><br>avgut=
il=3Dcluster_dfsused/cluster_capacity*100=3D57.14% used cluster =
capacity<br>Then we know avg node utilization =
(node_dfsused/node_capacity*100) .Balancer think that all good if&nbsp; =
avgutil +10&gt;node_utilizazation&gt;=3Davgutil-10.<br>


<br>Ideal case that all node used avgutl of capacity.but for 12TB node =
its only 6.5Tb and for 72Tb its about 40Tb.<br><br>Balancer cant help =
you.<br><br>Show me <a =
href=3D"http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNodes=3DLIVE"=
 =
target=3D"_blank">http://namenode.rambler.ru:50070/dfsnodelist.jsp?whatNod=
es=3DLIVE</a> if you can.<br>


<br>&nbsp;<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word"><div><div><br></div><div><br></div><blockqu=
ote type=3D"cite">
<div class=3D"gmail_quote">
In ideal case with replication factor 2 ,with two nodes 12Tb and 72Tb =
you will be able to have only 12Tb replication =
data.<br></div></blockquote><div><br></div></div><div>Yes, this is true =
for exactly two nodes in the cluster with 12 TB and 72 TB, but not true =
for more than two nodes in the cluster.</div>


<div><br><blockquote type=3D"cite"><div class=3D"gmail_quote"><br>Best
 way,on my opinion,it is using multiple racks.Nodes in rack must be with
 identical capacity.Racks must be identical capacity.<br>
For example:<br><br>rack1: 1 node with 72Tb<br>rack2: 6 nodes with =
12Tb<br>rack3: 3 nodes with 24Tb<br><br>It helps with balancing,because =
dublicated&nbsp; block must be another =
rack.<br><br></div></blockquote><div><br>
</div></div><div>The same question I asked earlier in this message, does =
multiple racks with default threshold for the balancer minimizes the =
difference between racks ?</div><div><br><blockquote type=3D"cite"><div =
class=3D"gmail_quote">


<div>Why did you select hdfs?May be lustre,cephfs and other is better =
choise.&nbsp; <br>
</div></div></blockquote><div><br></div></div><div>It wasn't my =
decision, and I probably can't change it now. I am new to this cluster =
and trying to understand few issues. I will explore other options as you =
mentioned.<br clear=3D"all">


<br>-- <br><a href=3D"http://balajin.net/blog" =
target=3D"_blank">http://balajin.net/blog</a><br><a =
href=3D"http://flic.kr/balajijegan" =
target=3D"_blank">http://flic.kr/balajijegan</a>

=
</div></div></blockquote></div></blockquote></div></div></blockquote></div=
></div>
</blockquote></div><br></div></div></blockquote></div><br><br =
clear=3D"all"><br>-- <br><a href=3D"http://balajin.net/blog" =
target=3D"_blank">http://balajin.net/blog</a><br><a =
href=3D"http://flic.kr/balajijegan" =
target=3D"_blank">http://flic.kr/balajijegan</a>
</div>
=
</blockquote></div><br></div></div></div></div></div></blockquote></div><b=
r>
</blockquote></div><br></body></html>=

--Apple-Mail=_E36118DA-9EF7-49CF-A71B-A56A6A3FB58B--