Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of samliuhadoop@gmail.com
 designates 74.125.83.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAHH8OOcTi3gf0wCwoSYe0Av56cScMDUzQy0QR_327VHYEmB4xg@mail.gmail.com>
References: 
 <CAHH8OOfYtHWmL69iOzuFKGx9aUSEJfTDPWui_TqNK1hKBvfGMA@mail.gmail.com>
	<51C41E24.7030108@mail.ntua.gr>
	<CAHH8OOcUffnDPdCYgSRd+1CRaGJoZHhVruz35jYDfXDFTzYm_Q@mail.gmail.com>
	<CAOcnVr3pM6vT+UQinSfZkVKhmqcjgdWsfgOBKtfyN0NWL_AAMA@mail.gmail.com>
	<CAHH8OOcTi3gf0wCwoSYe0Av56cScMDUzQy0QR_327VHYEmB4xg@mail.gmail.com>
Date: Wed, 31 Jul 2013 14:39:17 +0800
Message-ID: 
 <CAHH8OOd1pb-eMy3T4oaVapLY_4+b6-6ow3s8WJ-WGn5vxJTzRw@mail.gmail.com>
Subject: Re: Hang when add/remove a datanode into/from a 2 datanode cluster
From: sam liu <samliuhadoop@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=047d7b621c6816789f04e2c8fca1

--047d7b621c6816789f04e2c8fca1
Content-Type: text/plain; charset=ISO-8859-1

I opened a jira for tracking this issue:
https://issues.apache.org/jira/browse/HDFS-5046


2013/7/2 sam liu <samliuhadoop@gmail.com>

> Yes, the default replication factor is 3. However, in my case, it's
> strange: during decommission hangs, I found some block's expected replicas
> is 3, but the 'dfs.replication' value in hdfs-site.xml of every cluster
> node is always 2 from the beginning of cluster setup. Below is my steps:
>
> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And, in
> hdfs-site.xml, set the 'dfs.replication' to 2
> 2. Add node dn3 into the cluster as a new datanode, and did not change the
> 'dfs.replication' value in hdfs-site.xml and keep it as 2
> note: step 2 passed
>  3. Decommission dn3 from the cluster
> Expected result: dn3 could be decommissioned successfully
> Actual result:
> a). decommission progress hangs and the status always be 'Waiting DataNode
> status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /', the
> decommission continues and will be completed finally.
> b). However, if the initial cluster includes >= 3 datanodes, this issue
> won't be encountered when add/remove another datanode. For example, if I
> setup a cluster with 3 datanodes, and then I can successfully add the 4th
> datanode into it, and then also can successfully remove the 4th datanode
> from the cluster.
>
> I doubt it's a bug and plan to open a jira to Hadoop HDFS for this. Any
> comments?
>
> Thanks!
>
>
> 2013/6/21 Harsh J <harsh@cloudera.com>
>
>> The dfs.replication is a per-file parameter. If you have a client that
>> does not use the supplied configs, then its default replication is 3
>> and all files it will create (as part of the app or via a job config)
>> will be with replication factor 3.
>>
>> You can do an -lsr to find all files and filter which ones have been
>> created with a factor of 3 (versus expected config of 2).
>>
>> On Fri, Jun 21, 2013 at 3:13 PM, sam liu <samliuhadoop@gmail.com> wrote:
>> > Hi George,
>> >
>> > Actually, in my hdfs-site.xml, I always set 'dfs.replication'to 2. But
>> still
>> > encounter this issue.
>> >
>> > Thanks!
>> >
>> >
>> > 2013/6/21 George Kousiouris <gkousiou@mail.ntua.gr>
>> >>
>> >>
>> >> Hi,
>> >>
>> >> I think i have faced this before, the problem is that you have the rep
>> >> factor=3 so it seems to hang because it needs 3 nodes to achieve the
>> factor
>> >> (replicas are not created on the same node). If you set the replication
>> >> factor=2 i think you will not have this issue. So in general you must
>> make
>> >> sure that the rep factor is <= to the available datanodes.
>> >>
>> >> BR,
>> >> George
>> >>
>> >>
>> >> On 6/21/2013 12:29 PM, sam liu wrote:
>> >>
>> >> Hi,
>> >>
>> >> I encountered an issue which hangs the decommission operatoin. Its
>> steps:
>> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,
>> in
>> >> hdfs-site.xml, set the 'dfs.replication' to 2
>> >> 2. Add node dn3 into the cluster as a new datanode, and did not change
>> the
>> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
>> >> note: step 2 passed
>> >> 3. Decommission dn3 from the cluster
>> >>
>> >> Expected result: dn3 could be decommissioned successfully
>> >>
>> >> Actual result: decommission progress hangs and the status always be
>> >> 'Waiting DataNode status: Decommissioned'
>> >>
>> >> However, if the initial cluster includes >= 3 datanodes, this issue
>> won't
>> >> be encountered when add/remove another datanode.
>> >>
>> >> Also, after step 2, I noticed that some block's expected replicas is 3,
>> >> but the 'dfs.replication' value in hdfs-site.xml is always 2!
>> >>
>> >> Could anyone pls help provide some triages?
>> >>
>> >> Thanks in advance!
>> >>
>> >>
>> >>
>> >> --
>> >> ---------------------------
>> >>
>> >> George Kousiouris, PhD
>> >> Electrical and Computer Engineer
>> >> Division of Communications,
>> >> Electronics and Information Engineering
>> >> School of Electrical and Computer Engineering
>> >> Tel: +30 210 772 2546
>> >> Mobile: +30 6939354121
>> >> Fax: +30 210 772 2569
>> >> Email: gkousiou@mail.ntua.gr
>> >> Site: http://users.ntua.gr/gkousiou/
>> >>
>> >> National Technical University of Athens
>> >> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

--047d7b621c6816789f04e2c8fca1
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I opened a jira for tracking this issue: <a href=3D"https:=
//issues.apache.org/jira/browse/HDFS-5046">https://issues.apache.org/jira/b=
rowse/HDFS-5046</a><br></div><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">
2013/7/2 sam liu <span dir=3D"ltr">&lt;<a href=3D"mailto:samliuhadoop@gmail=
.com" target=3D"_blank">samliuhadoop@gmail.com</a>&gt;</span><br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">
<div dir=3D"ltr"><div>Yes, the default replication factor is 3. However, in=
 my case, it&#39;s strange: during decommission hangs, I found some block&#=
39;s expected replicas is 3, but the &#39;<a name=3D"13f9dfbdb21f3b71_13f66=
113d9f0969e_dfs.replication">dfs.replication</a>&#39; value in hdfs-site.xm=
l of every cluster node is always 2 from the beginning of cluster setup. Be=
low is my steps:<div class=3D"im">
<br>
<div>1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,=
 in hdfs-site.xml, set the &#39;<a name=3D"13f9dfbdb21f3b71_13f66113d9f0969=
e_dfs.replication">dfs.replication</a>&#39; to 2<br>
</div><div>2. Add node dn3 into the cluster as a new datanode, and did not =
change the &#39;<a name=3D"13f9dfbdb21f3b71_13f66113d9f0969e_dfs.replicatio=
n">dfs.replication</a>&#39; value in hdfs-site.xml and keep it as 2<br></di=
v>
<div>note: step 2 passed<br>
</div>
3. Decommission dn3 from the cluster<br><div>Expected result: dn3 could be =
decommissioned successfully<br></div>Actual result: <br></div>a). decommiss=
ion progress hangs and the status always be &#39;Waiting DataNode status: D=
ecommissioned&#39;. But, if I execute &#39;hadoop dfs -setrep -R 2 /&#39;, =
the decommission continues and will be completed finally.<br>

b). However, if the initial cluster includes &gt;=3D 3 datanodes, this issu=
e won&#39;t be encountered when add/remove another datanode. For example, i=
f I setup a cluster with 3 datanodes, and then I can successfully add the 4=
th datanode into it, and then also can successfully remove the 4th datanode=
 from the cluster.<br>

<br></div><div>I doubt it&#39;s a bug and plan to open a jira to Hadoop HDF=
S for this. Any comments?<br><br></div><div>Thanks!<br></div><div><div clas=
s=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2013=
/6/21 Harsh J <span dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com" t=
arget=3D"_blank">harsh@cloudera.com</a>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">The dfs.replication is a per-file parameter.=
 If you have a client that<br>
does not use the supplied configs, then its default replication is 3<br>
and all files it will create (as part of the app or via a job config)<br>
will be with replication factor 3.<br>
<br>
You can do an -lsr to find all files and filter which ones have been<br>
created with a factor of 3 (versus expected config of 2).<br>
<div><div><br>
On Fri, Jun 21, 2013 at 3:13 PM, sam liu &lt;<a href=3D"mailto:samliuhadoop=
@gmail.com" target=3D"_blank">samliuhadoop@gmail.com</a>&gt; wrote:<br>
&gt; Hi George,<br>
&gt;<br>
&gt; Actually, in my hdfs-site.xml, I always set &#39;dfs.replication&#39;t=
o 2. But still<br>
&gt; encounter this issue.<br>
&gt;<br>
&gt; Thanks!<br>
&gt;<br>
&gt;<br>
&gt; 2013/6/21 George Kousiouris &lt;<a href=3D"mailto:gkousiou@mail.ntua.g=
r" target=3D"_blank">gkousiou@mail.ntua.gr</a>&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Hi,<br>
&gt;&gt;<br>
&gt;&gt; I think i have faced this before, the problem is that you have the=
 rep<br>
&gt;&gt; factor=3D3 so it seems to hang because it needs 3 nodes to achieve=
 the factor<br>
&gt;&gt; (replicas are not created on the same node). If you set the replic=
ation<br>
&gt;&gt; factor=3D2 i think you will not have this issue. So in general you=
 must make<br>
&gt;&gt; sure that the rep factor is &lt;=3D to the available datanodes.<br=
>
&gt;&gt;<br>
&gt;&gt; BR,<br>
&gt;&gt; George<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On 6/21/2013 12:29 PM, sam liu wrote:<br>
&gt;&gt;<br>
&gt;&gt; Hi,<br>
&gt;&gt;<br>
&gt;&gt; I encountered an issue which hangs the decommission operatoin. Its=
 steps:<br>
&gt;&gt; 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. =
And, in<br>
&gt;&gt; hdfs-site.xml, set the &#39;dfs.replication&#39; to 2<br>
&gt;&gt; 2. Add node dn3 into the cluster as a new datanode, and did not ch=
ange the<br>
&gt;&gt; &#39;dfs.replication&#39; value in hdfs-site.xml and keep it as 2<=
br>
&gt;&gt; note: step 2 passed<br>
&gt;&gt; 3. Decommission dn3 from the cluster<br>
&gt;&gt;<br>
&gt;&gt; Expected result: dn3 could be decommissioned successfully<br>
&gt;&gt;<br>
&gt;&gt; Actual result: decommission progress hangs and the status always b=
e<br>
&gt;&gt; &#39;Waiting DataNode status: Decommissioned&#39;<br>
&gt;&gt;<br>
&gt;&gt; However, if the initial cluster includes &gt;=3D 3 datanodes, this=
 issue won&#39;t<br>
&gt;&gt; be encountered when add/remove another datanode.<br>
&gt;&gt;<br>
&gt;&gt; Also, after step 2, I noticed that some block&#39;s expected repli=
cas is 3,<br>
&gt;&gt; but the &#39;dfs.replication&#39; value in hdfs-site.xml is always=
 2!<br>
&gt;&gt;<br>
&gt;&gt; Could anyone pls help provide some triages?<br>
&gt;&gt;<br>
&gt;&gt; Thanks in advance!<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; ---------------------------<br>
&gt;&gt;<br>
&gt;&gt; George Kousiouris, PhD<br>
&gt;&gt; Electrical and Computer Engineer<br>
&gt;&gt; Division of Communications,<br>
&gt;&gt; Electronics and Information Engineering<br>
&gt;&gt; School of Electrical and Computer Engineering<br>
&gt;&gt; Tel: +30 210 772 2546<br>
&gt;&gt; Mobile: +30 6939354121<br>
&gt;&gt; Fax: +30 210 772 2569<br>
&gt;&gt; Email: <a href=3D"mailto:gkousiou@mail.ntua.gr" target=3D"_blank">=
gkousiou@mail.ntua.gr</a><br>
&gt;&gt; Site: <a href=3D"http://users.ntua.gr/gkousiou/" target=3D"_blank"=
>http://users.ntua.gr/gkousiou/</a><br>
&gt;&gt;<br>
&gt;&gt; National Technical University of Athens<br>
&gt;&gt; 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece<br>
&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div><span><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>

--047d7b621c6816789f04e2c8fca1--