Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 113B5F066 for ; Thu, 28 Mar 2013 20:01:26 +0000 (UTC) Received: (qmail 75156 invoked by uid 500); 28 Mar 2013 20:01:20 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 75050 invoked by uid 500); 28 Mar 2013 20:01:20 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 75043 invoked by uid 99); 28 Mar 2013 20:01:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Mar 2013 20:01:20 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of felix.giguere@mate1inc.com designates 74.125.245.72 as permitted sender) Received: from [74.125.245.72] (HELO na3sys010aog102.obsmtp.com) (74.125.245.72) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 28 Mar 2013 20:01:13 +0000 Received: from mail-we0-f197.google.com ([74.125.82.197]) (using TLSv1) by na3sys010aob102.postini.com ([74.125.244.12]) with SMTP ID DSNKUVShdMD6P+6sqWP7jAoSdjiFEdBJxpCM@postini.com; Thu, 28 Mar 2013 13:00:53 PDT Received: by mail-we0-f197.google.com with SMTP id p43so4771977wea.8 for ; Thu, 28 Mar 2013 13:00:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:x-received:mime-version:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=mZKHawooJUvfL4BeGb/VY5ksDf9pMDk63wllhVHK34E=; b=cIeGuhTiE+aWYvFcgktYLqMDkoMYaVR03Xw8bXN+8KBHYHrtC8fB9Wqq4kavO79rbj BjvTKWMaKiAxC3T5WCh3sqRSuJUugOpKxb1l4ajRSh8GjCB0Sl575PeV4jiN6gqXWwYy z2gNgbZJHE73v0y/lrtx3bOX2JgqYEY/nXb38UzN4OBZuQv/2Kn2Soc3GXUR1WRgOi3q WEwqLNX5a/+uUm72aLuxJ++YE6zG4nzhfGEY3VwAEMX7GDFXEGR2hBB8HIqenJFyyQcE zqOG2fXxb8qDYSrJWvJdgtKwGwO/ZLjCC51qPljs5oEV1ffLitkPoZ52aUwNpOQtt2SW 2kXw== X-Received: by 10.194.176.165 with SMTP id cj5mr40703241wjc.37.1364500851152; Thu, 28 Mar 2013 13:00:51 -0700 (PDT) X-Received: by 10.194.176.165 with SMTP id cj5mr40703222wjc.37.1364500850961; Thu, 28 Mar 2013 13:00:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.238.196 with HTTP; Thu, 28 Mar 2013 13:00:30 -0700 (PDT) From: Felix GV Date: Thu, 28 Mar 2013 16:00:30 -0400 Message-ID: Subject: Why do some blocks refuse to replicate...? To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e013d15d884898904d9019c93 X-Gm-Message-State: ALoCoQnb7dJoU7/Rn8vn8PRfPoZJfl4dcblBupl/AQ6XvxAIQKoH5h6J6jts4hrID4d11ntj1Mwz5VR3BsJqLU/05LQJvnuPnrRi0YCRj10tuqGDoyCos91jhEXo/NzaeV/RtfvI6OcvyrP5+j8TTX6qmO08eZN3BuZSaCRT6OZ43M+d7yFDano= X-Virus-Checked: Checked by ClamAV on apache.org --089e013d15d884898904d9019c93 Content-Type: text/plain; charset=ISO-8859-1 Hello, I've been running a virtualized CDH 4.2 cluster. I now want to migrate all my data to another (this time physical) set of slaves and then stop using the virtualized slaves. I added the new physical slaves in the cluster, and marked all the old virtualized slaves as decommissioned using the dfs.hosts.exclude setting in hdfs-site.xml. Almost all of the data replicated successfully to the new slaves, but when I bring down the old slaves, some blocks start showing up as missing or corrupt (according to the NN UI as well as fsck*). If I restart the old slaves, then there are no missing blocks reported by fsck. I've tried shutting down the old slaves two by two, and for some of them I saw no problem, but then at some point I found two slaves which, when shut down, resulted in a couple of blocks being under-replicated (1 out of 3 replicas found). For example, fsck would report stuff like this: /user/hive/warehouse/ads_destinations_hosts/part-m-00012: Under replicated BP-1207449144-10.10.10.21-1356639087818:blk_6150201737015349469_121244. Target Replicas is 3 but found 1 replica(s). The system then stayed in that state apparently forever. It never actually fixed the fact some blocks were under-replicated. Does that mean there's something wrong with some of the old datanodes...? Why do they keep block for themselves (even thought they're decommissioned) instead of replicating those blocks to the new (non-decommissioned) datanodes? How do I force replication of under-replicated blocks? *Actually, the NN UI and fsck report slightly different things. The NN UI always seems to report 60 under-replicated blocks, whereas fsck only reports those 60 under-replicated blocks when I shut down some of the old datanodes... When the old nodes are up, fsck reports 0 under-replicated blocks... This is very confusing! Any help would be appreciated! Please don't hesitate to ask if I should provide some of my logs, settings, or the output of some commands...! Thanks :) ! -- Felix --089e013d15d884898904d9019c93 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hello,

I&#= 39;ve been running a virtualized CDH 4.2 cluster. I now want to migrate all= my data to another (this time physical) set of slaves and then stop using = the virtualized slaves.

I added the new physical slaves in the clus= ter, and marked all the old virtualized slaves as decommissioned using the= =A0dfs.hosts.exclude setting in hdfs-site.xml.

Almost all of the data replicated successfully to the new slaves, but when = I bring down the old slaves, some blocks start showing up as missing or cor= rupt (according to the NN UI as well as fsck*). If I restart the old slaves= , then there are no missing blocks reported by fsck.

I've tried shutting down the old slaves= two by two, and for some of them I saw no problem, but then at some point = I found two slaves which, when shut down, resulted in a couple of blocks be= ing under-replicated (1 out of 3 replicas found). For example, fsck would r= eport stuff like this:

/user/hive/warehouse/ads_destinations_= hosts/part-m-00012: =A0Under replicated BP-1207449144-10.10.10.21-135663908= 7818:blk_6150201737015349469_121244. Target Replicas is 3 but found 1 repli= ca(s).

The system then stayed in that state = apparently forever. It never actually fixed the fact some blocks were under= -replicated. Does that mean there's something wrong with some of the ol= d datanodes...? Why do they keep block for themselves (even thought they= 9;re decommissioned) instead of replicating those blocks to the new (non-de= commissioned) datanodes?

How do I force replication of under-replica= ted blocks?

*Actually, the NN UI and f= sck report slightly different things. The NN UI always seems to report 60 u= nder-replicated blocks, whereas fsck only reports those 60 under-replicated= blocks when I shut down some of the old datanodes... When the old nodes ar= e up, fsck reports 0 under-replicated blocks... This is very confusing!

Any help would be appreciated! Please don&#= 39;t hesitate to ask if I should provide some of my logs, settings, or the = output of some commands...!

Thanks :) = !

--
Felix
--089e013d15d884898904d9019c93--