Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C25F10C2A for ; Thu, 1 Aug 2013 03:12:38 +0000 (UTC) Received: (qmail 66175 invoked by uid 500); 1 Aug 2013 03:12:32 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 65060 invoked by uid 500); 1 Aug 2013 03:12:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 65049 invoked by uid 99); 1 Aug 2013 03:12:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Aug 2013 03:12:29 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.219.48 as permitted sender) Received: from [209.85.219.48] (HELO mail-oa0-f48.google.com) (209.85.219.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Aug 2013 03:12:23 +0000 Received: by mail-oa0-f48.google.com with SMTP id f4so3134973oah.7 for ; Wed, 31 Jul 2013 20:12:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=Qk7DJdV6XawmHGFxM73+Po5V5XE4vxSLIvQmANwYK/Q=; b=pHHapG7XNKo69TSr3xPeAIHkU/PySiqX7Ezw4TV2Np6ADj5BTrL7Dk04s9H4iXwcC6 +3paHzrnHLqs+d35xdUxjbCpXu/aOeOIlYJ3RsP+YlP1lH1amvMKZq4DNlqj+0uSRJiP sZDkqnaFzB4/TOobFqhEUNZmFM+RkGso/4fZKH31mXIikw/QR3ly3wc+CU4c7r7/wLmT ShVehGuEnbzQ0caDBvhxw65IxgijI52v3kgi/qZZobSzcnykULoVVHAh4PByHE8LmgE7 t5iy52hcvL6xXUcUesX8sEaCGNRAifeiFC9Lot3HJxUX293NGlzZT/oXyfegEb9bGhvk Yr3A== X-Received: by 10.43.139.133 with SMTP id iw5mr23117441icc.0.1375326722317; Wed, 31 Jul 2013 20:12:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.87.164 with HTTP; Wed, 31 Jul 2013 20:11:41 -0700 (PDT) In-Reply-To: References: <51C41E24.7030108@mail.ntua.gr> From: Harsh J Date: Thu, 1 Aug 2013 08:41:41 +0530 Message-ID: Subject: Re: Hang when add/remove a datanode into/from a 2 datanode cluster To: "" Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQnVFi0W+FaiFLhmXHmL9Wzf5I2hEMpXAU/ISYGmeajfVPWqoqGXXibVISM5ZF0kJmPSsn7k X-Virus-Checked: Checked by ClamAV on apache.org As I said before, it is a per-file property and the config can be bypassed by clients that do not read the configs, place a manual API override, etc.. If you want to really define a hard maximum and catch such clients, try setting dfs.replication.max to 2 at your NameNode. On Thu, Aug 1, 2013 at 8:07 AM, sam liu wrote: > But, please mention that the value of 'dfs.replication' of the cluster is > always 2, even when the datanode number is 3. And I am pretty sure I did not > manually create any files with rep=3. So, why were some files of hdfs > created with repl=3, but not repl=2? > > > 2013/8/1 Harsh J >> >> The step (a) points to your problem and solution both. You have files >> being created with repl=3 on a 2 DN cluster which will prevent >> decommission. This is not a bug. >> >> On Wed, Jul 31, 2013 at 12:09 PM, sam liu wrote: >> > I opened a jira for tracking this issue: >> > https://issues.apache.org/jira/browse/HDFS-5046 >> > >> > >> > 2013/7/2 sam liu >> >> >> >> Yes, the default replication factor is 3. However, in my case, it's >> >> strange: during decommission hangs, I found some block's expected >> >> replicas >> >> is 3, but the 'dfs.replication' value in hdfs-site.xml of every cluster >> >> node >> >> is always 2 from the beginning of cluster setup. Below is my steps: >> >> >> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And, >> >> in >> >> hdfs-site.xml, set the 'dfs.replication' to 2 >> >> 2. Add node dn3 into the cluster as a new datanode, and did not change >> >> the >> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2 >> >> note: step 2 passed >> >> 3. Decommission dn3 from the cluster >> >> Expected result: dn3 could be decommissioned successfully >> >> Actual result: >> >> a). decommission progress hangs and the status always be 'Waiting >> >> DataNode >> >> status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /', >> >> the >> >> decommission continues and will be completed finally. >> >> b). However, if the initial cluster includes >= 3 datanodes, this issue >> >> won't be encountered when add/remove another datanode. For example, if >> >> I >> >> setup a cluster with 3 datanodes, and then I can successfully add the >> >> 4th >> >> datanode into it, and then also can successfully remove the 4th >> >> datanode >> >> from the cluster. >> >> >> >> I doubt it's a bug and plan to open a jira to Hadoop HDFS for this. Any >> >> comments? >> >> >> >> Thanks! >> >> >> >> >> >> 2013/6/21 Harsh J >> >>> >> >>> The dfs.replication is a per-file parameter. If you have a client that >> >>> does not use the supplied configs, then its default replication is 3 >> >>> and all files it will create (as part of the app or via a job config) >> >>> will be with replication factor 3. >> >>> >> >>> You can do an -lsr to find all files and filter which ones have been >> >>> created with a factor of 3 (versus expected config of 2). >> >>> >> >>> On Fri, Jun 21, 2013 at 3:13 PM, sam liu >> >>> wrote: >> >>> > Hi George, >> >>> > >> >>> > Actually, in my hdfs-site.xml, I always set 'dfs.replication'to 2. >> >>> > But >> >>> > still >> >>> > encounter this issue. >> >>> > >> >>> > Thanks! >> >>> > >> >>> > >> >>> > 2013/6/21 George Kousiouris >> >>> >> >> >>> >> >> >>> >> Hi, >> >>> >> >> >>> >> I think i have faced this before, the problem is that you have the >> >>> >> rep >> >>> >> factor=3 so it seems to hang because it needs 3 nodes to achieve >> >>> >> the >> >>> >> factor >> >>> >> (replicas are not created on the same node). If you set the >> >>> >> replication >> >>> >> factor=2 i think you will not have this issue. So in general you >> >>> >> must >> >>> >> make >> >>> >> sure that the rep factor is <= to the available datanodes. >> >>> >> >> >>> >> BR, >> >>> >> George >> >>> >> >> >>> >> >> >>> >> On 6/21/2013 12:29 PM, sam liu wrote: >> >>> >> >> >>> >> Hi, >> >>> >> >> >>> >> I encountered an issue which hangs the decommission operatoin. Its >> >>> >> steps: >> >>> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. >> >>> >> And, >> >>> >> in >> >>> >> hdfs-site.xml, set the 'dfs.replication' to 2 >> >>> >> 2. Add node dn3 into the cluster as a new datanode, and did not >> >>> >> change >> >>> >> the >> >>> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2 >> >>> >> note: step 2 passed >> >>> >> 3. Decommission dn3 from the cluster >> >>> >> >> >>> >> Expected result: dn3 could be decommissioned successfully >> >>> >> >> >>> >> Actual result: decommission progress hangs and the status always be >> >>> >> 'Waiting DataNode status: Decommissioned' >> >>> >> >> >>> >> However, if the initial cluster includes >= 3 datanodes, this issue >> >>> >> won't >> >>> >> be encountered when add/remove another datanode. >> >>> >> >> >>> >> Also, after step 2, I noticed that some block's expected replicas >> >>> >> is >> >>> >> 3, >> >>> >> but the 'dfs.replication' value in hdfs-site.xml is always 2! >> >>> >> >> >>> >> Could anyone pls help provide some triages? >> >>> >> >> >>> >> Thanks in advance! >> >>> >> >> >>> >> >> >>> >> >> >>> >> -- >> >>> >> --------------------------- >> >>> >> >> >>> >> George Kousiouris, PhD >> >>> >> Electrical and Computer Engineer >> >>> >> Division of Communications, >> >>> >> Electronics and Information Engineering >> >>> >> School of Electrical and Computer Engineering >> >>> >> Tel: +30 210 772 2546 >> >>> >> Mobile: +30 6939354121 >> >>> >> Fax: +30 210 772 2569 >> >>> >> Email: gkousiou@mail.ntua.gr >> >>> >> Site: http://users.ntua.gr/gkousiou/ >> >>> >> >> >>> >> National Technical University of Athens >> >>> >> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece >> >>> > >> >>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> Harsh J >> >> >> >> >> > >> >> >> >> -- >> Harsh J > > -- Harsh J