Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of samliuhadoop@gmail.com
 designates 74.125.83.43 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr3pM6vT+UQinSfZkVKhmqcjgdWsfgOBKtfyN0NWL_AAMA@mail.gmail.com>
References: 
 <CAHH8OOfYtHWmL69iOzuFKGx9aUSEJfTDPWui_TqNK1hKBvfGMA@mail.gmail.com>
	<51C41E24.7030108@mail.ntua.gr>
	<CAHH8OOcUffnDPdCYgSRd+1CRaGJoZHhVruz35jYDfXDFTzYm_Q@mail.gmail.com>
	<CAOcnVr3pM6vT+UQinSfZkVKhmqcjgdWsfgOBKtfyN0NWL_AAMA@mail.gmail.com>
Date: Tue, 2 Jul 2013 14:04:56 +0800
Message-ID: 
 <CAHH8OOcTi3gf0wCwoSYe0Av56cScMDUzQy0QR_327VHYEmB4xg@mail.gmail.com>
Subject: Re: Hang when add/remove a datanode into/from a 2 datanode cluster
From: sam liu <samliuhadoop@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=047d7b6042bed4af4004e0811f95

--047d7b6042bed4af4004e0811f95
Content-Type: text/plain; charset=ISO-8859-1

Yes, the default replication factor is 3. However, in my case, it's
strange: during decommission hangs, I found some block's expected replicas
is 3, but the 'dfs.replication' value in hdfs-site.xml of every cluster
node is always 2 from the beginning of cluster setup. Below is my steps:
1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And, in
hdfs-site.xml, set the 'dfs.replication' to 2
2. Add node dn3 into the cluster as a new datanode, and did not change the '
dfs.replication' value in hdfs-site.xml and keep it as 2
note: step 2 passed
3. Decommission dn3 from the cluster
Expected result: dn3 could be decommissioned successfully
Actual result:
a). decommission progress hangs and the status always be 'Waiting DataNode
status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /', the
decommission continues and will be completed finally.
b). However, if the initial cluster includes >= 3 datanodes, this issue
won't be encountered when add/remove another datanode. For example, if I
setup a cluster with 3 datanodes, and then I can successfully add the 4th
datanode into it, and then also can successfully remove the 4th datanode
from the cluster.

I doubt it's a bug and plan to open a jira to Hadoop HDFS for this. Any
comments?

Thanks!


2013/6/21 Harsh J <harsh@cloudera.com>

> The dfs.replication is a per-file parameter. If you have a client that
> does not use the supplied configs, then its default replication is 3
> and all files it will create (as part of the app or via a job config)
> will be with replication factor 3.
>
> You can do an -lsr to find all files and filter which ones have been
> created with a factor of 3 (versus expected config of 2).
>
> On Fri, Jun 21, 2013 at 3:13 PM, sam liu <samliuhadoop@gmail.com> wrote:
> > Hi George,
> >
> > Actually, in my hdfs-site.xml, I always set 'dfs.replication'to 2. But
> still
> > encounter this issue.
> >
> > Thanks!
> >
> >
> > 2013/6/21 George Kousiouris <gkousiou@mail.ntua.gr>
> >>
> >>
> >> Hi,
> >>
> >> I think i have faced this before, the problem is that you have the rep
> >> factor=3 so it seems to hang because it needs 3 nodes to achieve the
> factor
> >> (replicas are not created on the same node). If you set the replication
> >> factor=2 i think you will not have this issue. So in general you must
> make
> >> sure that the rep factor is <= to the available datanodes.
> >>
> >> BR,
> >> George
> >>
> >>
> >> On 6/21/2013 12:29 PM, sam liu wrote:
> >>
> >> Hi,
> >>
> >> I encountered an issue which hangs the decommission operatoin. Its
> steps:
> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,
> in
> >> hdfs-site.xml, set the 'dfs.replication' to 2
> >> 2. Add node dn3 into the cluster as a new datanode, and did not change
> the
> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
> >> note: step 2 passed
> >> 3. Decommission dn3 from the cluster
> >>
> >> Expected result: dn3 could be decommissioned successfully
> >>
> >> Actual result: decommission progress hangs and the status always be
> >> 'Waiting DataNode status: Decommissioned'
> >>
> >> However, if the initial cluster includes >= 3 datanodes, this issue
> won't
> >> be encountered when add/remove another datanode.
> >>
> >> Also, after step 2, I noticed that some block's expected replicas is 3,
> >> but the 'dfs.replication' value in hdfs-site.xml is always 2!
> >>
> >> Could anyone pls help provide some triages?
> >>
> >> Thanks in advance!
> >>
> >>
> >>
> >> --
> >> ---------------------------
> >>
> >> George Kousiouris, PhD
> >> Electrical and Computer Engineer
> >> Division of Communications,
> >> Electronics and Information Engineering
> >> School of Electrical and Computer Engineering
> >> Tel: +30 210 772 2546
> >> Mobile: +30 6939354121
> >> Fax: +30 210 772 2569
> >> Email: gkousiou@mail.ntua.gr
> >> Site: http://users.ntua.gr/gkousiou/
> >>
> >> National Technical University of Athens
> >> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
> >
> >
>
>
>
> --
> Harsh J
>

--047d7b6042bed4af4004e0811f95
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Yes, the default replication factor is 3. However, in=
 my case, it&#39;s strange: during decommission hangs, I found some block&#=
39;s expected replicas is 3, but the &#39;<a name=3D"13f66113d9f0969e_dfs.r=
eplication">dfs.replication</a>&#39; value in hdfs-site.xml of every cluste=
r node is always 2 from the beginning of cluster setup. Below is my steps:<=
br>
<div>1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,=
 in hdfs-site.xml, set the &#39;<a name=3D"13f66113d9f0969e_dfs.replication=
">dfs.replication</a>&#39; to 2<br>
</div><div>2. Add node dn3 into the cluster as a new datanode, and did not =
change the &#39;<a name=3D"13f66113d9f0969e_dfs.replication">dfs.replicatio=
n</a>&#39; value in hdfs-site.xml and keep it as 2<br></div><div>note: step=
 2 passed<br>
</div>
3. Decommission dn3 from the cluster<br><div>Expected result: dn3 could be =
decommissioned successfully<br></div>Actual result: <br>a). decommission pr=
ogress hangs and the status always be &#39;Waiting DataNode status: Decommi=
ssioned&#39;. But, if I execute &#39;hadoop dfs -setrep -R 2 /&#39;, the de=
commission continues and will be completed finally.<br>
b). However, if the initial cluster includes &gt;=3D 3 datanodes, this issu=
e won&#39;t be encountered when add/remove another datanode. For example, i=
f I setup a cluster with 3 datanodes, and then I can successfully add the 4=
th datanode into it, and then also can successfully remove the 4th datanode=
 from the cluster.<br>
<br></div><div>I doubt it&#39;s a bug and plan to open a jira to Hadoop HDF=
S for this. Any comments?<br><br></div><div>Thanks!<br></div><div class=3D"=
gmail_extra"><br><br><div class=3D"gmail_quote">2013/6/21 Harsh J <span dir=
=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com" target=3D"_blank">harsh@=
cloudera.com</a>&gt;</span><br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">The dfs.replication is a per-file parameter.=
 If you have a client that<br>
does not use the supplied configs, then its default replication is 3<br>
and all files it will create (as part of the app or via a job config)<br>
will be with replication factor 3.<br>
<br>
You can do an -lsr to find all files and filter which ones have been<br>
created with a factor of 3 (versus expected config of 2).<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Fri, Jun 21, 2013 at 3:13 PM, sam liu &lt;<a href=3D"mailto:samliuhadoop=
@gmail.com">samliuhadoop@gmail.com</a>&gt; wrote:<br>
&gt; Hi George,<br>
&gt;<br>
&gt; Actually, in my hdfs-site.xml, I always set &#39;dfs.replication&#39;t=
o 2. But still<br>
&gt; encounter this issue.<br>
&gt;<br>
&gt; Thanks!<br>
&gt;<br>
&gt;<br>
&gt; 2013/6/21 George Kousiouris &lt;<a href=3D"mailto:gkousiou@mail.ntua.g=
r">gkousiou@mail.ntua.gr</a>&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Hi,<br>
&gt;&gt;<br>
&gt;&gt; I think i have faced this before, the problem is that you have the=
 rep<br>
&gt;&gt; factor=3D3 so it seems to hang because it needs 3 nodes to achieve=
 the factor<br>
&gt;&gt; (replicas are not created on the same node). If you set the replic=
ation<br>
&gt;&gt; factor=3D2 i think you will not have this issue. So in general you=
 must make<br>
&gt;&gt; sure that the rep factor is &lt;=3D to the available datanodes.<br=
>
&gt;&gt;<br>
&gt;&gt; BR,<br>
&gt;&gt; George<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On 6/21/2013 12:29 PM, sam liu wrote:<br>
&gt;&gt;<br>
&gt;&gt; Hi,<br>
&gt;&gt;<br>
&gt;&gt; I encountered an issue which hangs the decommission operatoin. Its=
 steps:<br>
&gt;&gt; 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. =
And, in<br>
&gt;&gt; hdfs-site.xml, set the &#39;dfs.replication&#39; to 2<br>
&gt;&gt; 2. Add node dn3 into the cluster as a new datanode, and did not ch=
ange the<br>
&gt;&gt; &#39;dfs.replication&#39; value in hdfs-site.xml and keep it as 2<=
br>
&gt;&gt; note: step 2 passed<br>
&gt;&gt; 3. Decommission dn3 from the cluster<br>
&gt;&gt;<br>
&gt;&gt; Expected result: dn3 could be decommissioned successfully<br>
&gt;&gt;<br>
&gt;&gt; Actual result: decommission progress hangs and the status always b=
e<br>
&gt;&gt; &#39;Waiting DataNode status: Decommissioned&#39;<br>
&gt;&gt;<br>
&gt;&gt; However, if the initial cluster includes &gt;=3D 3 datanodes, this=
 issue won&#39;t<br>
&gt;&gt; be encountered when add/remove another datanode.<br>
&gt;&gt;<br>
&gt;&gt; Also, after step 2, I noticed that some block&#39;s expected repli=
cas is 3,<br>
&gt;&gt; but the &#39;dfs.replication&#39; value in hdfs-site.xml is always=
 2!<br>
&gt;&gt;<br>
&gt;&gt; Could anyone pls help provide some triages?<br>
&gt;&gt;<br>
&gt;&gt; Thanks in advance!<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; ---------------------------<br>
&gt;&gt;<br>
&gt;&gt; George Kousiouris, PhD<br>
&gt;&gt; Electrical and Computer Engineer<br>
&gt;&gt; Division of Communications,<br>
&gt;&gt; Electronics and Information Engineering<br>
&gt;&gt; School of Electrical and Computer Engineering<br>
&gt;&gt; Tel: +30 210 772 2546<br>
&gt;&gt; Mobile: +30 6939354121<br>
&gt;&gt; Fax: +30 210 772 2569<br>
&gt;&gt; Email: <a href=3D"mailto:gkousiou@mail.ntua.gr">gkousiou@mail.ntua=
.gr</a><br>
&gt;&gt; Site: <a href=3D"http://users.ntua.gr/gkousiou/" target=3D"_blank"=
>http://users.ntua.gr/gkousiou/</a><br>
&gt;&gt;<br>
&gt;&gt; National Technical University of Athens<br>
&gt;&gt; 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece<br>
&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br></div></div>

--047d7b6042bed4af4004e0811f95--