Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9585410C23 for ; Wed, 31 Jul 2013 06:39:50 +0000 (UTC) Received: (qmail 50843 invoked by uid 500); 31 Jul 2013 06:39:44 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 50509 invoked by uid 500); 31 Jul 2013 06:39:42 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 50502 invoked by uid 99); 31 Jul 2013 06:39:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2013 06:39:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of samliuhadoop@gmail.com designates 74.125.83.44 as permitted sender) Received: from [74.125.83.44] (HELO mail-ee0-f44.google.com) (74.125.83.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2013 06:39:38 +0000 Received: by mail-ee0-f44.google.com with SMTP id b47so134725eek.31 for ; Tue, 30 Jul 2013 23:39:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=PQWi8Sq2VusF7eROnxPeAtDiCpB1Lb6HGdzjwtf2JCE=; b=IJWXk+5KTbG3DHiv8xS3zsM/6lup4E6svAEpKp7Gspb4y/qR/WSMs+s8zvhkjbzNBp 2AGQc2gJtheWk6Arb6KUYsIuw+svWHX+Kw3ppynbaGfeNRwXl4x0FOQ3nqhINAr1d38t oNC/4dImWH8mQGDBkNaGBgwCxGTgJwgWhq7Px/jyHtMOxg48ChjccYWS/ybbIm2GRP+r dJwvoPfrA6IhrbNBye5NkDBpTkyoV9kDiAF02We71g8aMqMdWVpSKUHwhfoPSrzAPSK7 Dcre06ddYYyErDg9N3mTdFw+NOvk2KyqRu2d05z55TpjBZt1UmIB6zRHH0c9diT50H0f qDCg== MIME-Version: 1.0 X-Received: by 10.14.220.66 with SMTP id n42mr66866301eep.67.1375252757501; Tue, 30 Jul 2013 23:39:17 -0700 (PDT) Received: by 10.14.223.196 with HTTP; Tue, 30 Jul 2013 23:39:17 -0700 (PDT) In-Reply-To: References: <51C41E24.7030108@mail.ntua.gr> Date: Wed, 31 Jul 2013 14:39:17 +0800 Message-ID: Subject: Re: Hang when add/remove a datanode into/from a 2 datanode cluster From: sam liu To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=047d7b621c6816789f04e2c8fca1 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b621c6816789f04e2c8fca1 Content-Type: text/plain; charset=ISO-8859-1 I opened a jira for tracking this issue: https://issues.apache.org/jira/browse/HDFS-5046 2013/7/2 sam liu > Yes, the default replication factor is 3. However, in my case, it's > strange: during decommission hangs, I found some block's expected replicas > is 3, but the 'dfs.replication' value in hdfs-site.xml of every cluster > node is always 2 from the beginning of cluster setup. Below is my steps: > > 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And, in > hdfs-site.xml, set the 'dfs.replication' to 2 > 2. Add node dn3 into the cluster as a new datanode, and did not change the > 'dfs.replication' value in hdfs-site.xml and keep it as 2 > note: step 2 passed > 3. Decommission dn3 from the cluster > Expected result: dn3 could be decommissioned successfully > Actual result: > a). decommission progress hangs and the status always be 'Waiting DataNode > status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /', the > decommission continues and will be completed finally. > b). However, if the initial cluster includes >= 3 datanodes, this issue > won't be encountered when add/remove another datanode. For example, if I > setup a cluster with 3 datanodes, and then I can successfully add the 4th > datanode into it, and then also can successfully remove the 4th datanode > from the cluster. > > I doubt it's a bug and plan to open a jira to Hadoop HDFS for this. Any > comments? > > Thanks! > > > 2013/6/21 Harsh J > >> The dfs.replication is a per-file parameter. If you have a client that >> does not use the supplied configs, then its default replication is 3 >> and all files it will create (as part of the app or via a job config) >> will be with replication factor 3. >> >> You can do an -lsr to find all files and filter which ones have been >> created with a factor of 3 (versus expected config of 2). >> >> On Fri, Jun 21, 2013 at 3:13 PM, sam liu wrote: >> > Hi George, >> > >> > Actually, in my hdfs-site.xml, I always set 'dfs.replication'to 2. But >> still >> > encounter this issue. >> > >> > Thanks! >> > >> > >> > 2013/6/21 George Kousiouris >> >> >> >> >> >> Hi, >> >> >> >> I think i have faced this before, the problem is that you have the rep >> >> factor=3 so it seems to hang because it needs 3 nodes to achieve the >> factor >> >> (replicas are not created on the same node). If you set the replication >> >> factor=2 i think you will not have this issue. So in general you must >> make >> >> sure that the rep factor is <= to the available datanodes. >> >> >> >> BR, >> >> George >> >> >> >> >> >> On 6/21/2013 12:29 PM, sam liu wrote: >> >> >> >> Hi, >> >> >> >> I encountered an issue which hangs the decommission operatoin. Its >> steps: >> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And, >> in >> >> hdfs-site.xml, set the 'dfs.replication' to 2 >> >> 2. Add node dn3 into the cluster as a new datanode, and did not change >> the >> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2 >> >> note: step 2 passed >> >> 3. Decommission dn3 from the cluster >> >> >> >> Expected result: dn3 could be decommissioned successfully >> >> >> >> Actual result: decommission progress hangs and the status always be >> >> 'Waiting DataNode status: Decommissioned' >> >> >> >> However, if the initial cluster includes >= 3 datanodes, this issue >> won't >> >> be encountered when add/remove another datanode. >> >> >> >> Also, after step 2, I noticed that some block's expected replicas is 3, >> >> but the 'dfs.replication' value in hdfs-site.xml is always 2! >> >> >> >> Could anyone pls help provide some triages? >> >> >> >> Thanks in advance! >> >> >> >> >> >> >> >> -- >> >> --------------------------- >> >> >> >> George Kousiouris, PhD >> >> Electrical and Computer Engineer >> >> Division of Communications, >> >> Electronics and Information Engineering >> >> School of Electrical and Computer Engineering >> >> Tel: +30 210 772 2546 >> >> Mobile: +30 6939354121 >> >> Fax: +30 210 772 2569 >> >> Email: gkousiou@mail.ntua.gr >> >> Site: http://users.ntua.gr/gkousiou/ >> >> >> >> National Technical University of Athens >> >> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece >> > >> > >> >> >> >> -- >> Harsh J >> > > --047d7b621c6816789f04e2c8fca1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I opened a jira for tracking this issue: https://issues.apache.org/jira/b= rowse/HDFS-5046


2013/7/2 sam liu <samliuhadoop@gmail.com>
Yes, the default replication factor is 3. However, in= my case, it's strange: during decommission hangs, I found some block&#= 39;s expected replicas is 3, but the 'dfs.replication' value in hdfs-site.xm= l of every cluster node is always 2 from the beginning of cluster setup. Be= low is my steps:

1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,= in hdfs-site.xml, set the 'dfs.replication' to 2
2. Add node dn3 into the cluster as a new datanode, and did not = change the 'dfs.replication' value in hdfs-site.xml and keep it as 2
note: step 2 passed
3. Decommission dn3 from the cluster
Expected result: dn3 could be = decommissioned successfully
Actual result:
a). decommiss= ion progress hangs and the status always be 'Waiting DataNode status: D= ecommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /', = the decommission continues and will be completed finally.
b). However, if the initial cluster includes >=3D 3 datanodes, this issu= e won't be encountered when add/remove another datanode. For example, i= f I setup a cluster with 3 datanodes, and then I can successfully add the 4= th datanode into it, and then also can successfully remove the 4th datanode= from the cluster.

I doubt it's a bug and plan to open a jira to Hadoop HDF= S for this. Any comments?

Thanks!


2013= /6/21 Harsh J <harsh@cloudera.com>
The dfs.replication is a per-file parameter.= If you have a client that
does not use the supplied configs, then its default replication is 3
and all files it will create (as part of the app or via a job config)
will be with replication factor 3.

You can do an -lsr to find all files and filter which ones have been
created with a factor of 3 (versus expected config of 2).

On Fri, Jun 21, 2013 at 3:13 PM, sam liu <samliuhadoop@gmail.com> wrote:
> Hi George,
>
> Actually, in my hdfs-site.xml, I always set 'dfs.replication't= o 2. But still
> encounter this issue.
>
> Thanks!
>
>
> 2013/6/21 George Kousiouris <gkousiou@mail.ntua.gr>
>>
>>
>> Hi,
>>
>> I think i have faced this before, the problem is that you have the= rep
>> factor=3D3 so it seems to hang because it needs 3 nodes to achieve= the factor
>> (replicas are not created on the same node). If you set the replic= ation
>> factor=3D2 i think you will not have this issue. So in general you= must make
>> sure that the rep factor is <=3D to the available datanodes. >>
>> BR,
>> George
>>
>>
>> On 6/21/2013 12:29 PM, sam liu wrote:
>>
>> Hi,
>>
>> I encountered an issue which hangs the decommission operatoin. Its= steps:
>> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. = And, in
>> hdfs-site.xml, set the 'dfs.replication' to 2
>> 2. Add node dn3 into the cluster as a new datanode, and did not ch= ange the
>> 'dfs.replication' value in hdfs-site.xml and keep it as 2<= br> >> note: step 2 passed
>> 3. Decommission dn3 from the cluster
>>
>> Expected result: dn3 could be decommissioned successfully
>>
>> Actual result: decommission progress hangs and the status always b= e
>> 'Waiting DataNode status: Decommissioned'
>>
>> However, if the initial cluster includes >=3D 3 datanodes, this= issue won't
>> be encountered when add/remove another datanode.
>>
>> Also, after step 2, I noticed that some block's expected repli= cas is 3,
>> but the 'dfs.replication' value in hdfs-site.xml is always= 2!
>>
>> Could anyone pls help provide some triages?
>>
>> Thanks in advance!
>>
>>
>>
>> --
>> ---------------------------
>>
>> George Kousiouris, PhD
>> Electrical and Computer Engineer
>> Division of Communications,
>> Electronics and Information Engineering
>> School of Electrical and Computer Engineering
>> Tel: +30 210 772 2546
>> Mobile: +30 6939354121
>> Fax: +30 210 772 2569
>> Email: = gkousiou@mail.ntua.gr
>> Site: http://users.ntua.gr/gkousiou/
>>
>> National Technical University of Athens
>> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
>
>



--
Harsh J


--047d7b621c6816789f04e2c8fca1--