Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of felix.giguere@mate1inc.com
 designates 74.125.245.72 as permitted sender)
MIME-Version: 1.0
From: Felix GV <felix@mate1inc.com>
Date: Thu, 28 Mar 2013 16:00:30 -0400
Message-ID: 
 <CAECHK7uKfbOzuZckTvcp6k_CSk6Xbm_xwj453YSUBAfZUtdvow@mail.gmail.com>
Subject: Why do some blocks refuse to replicate...?
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e013d15d884898904d9019c93

--089e013d15d884898904d9019c93
Content-Type: text/plain; charset=ISO-8859-1

Hello,

I've been running a virtualized CDH 4.2 cluster. I now want to migrate all
my data to another (this time physical) set of slaves and then stop using
the virtualized slaves.

I added the new physical slaves in the cluster, and marked all the old
virtualized slaves as decommissioned using the dfs.hosts.exclude setting in
hdfs-site.xml.

Almost all of the data replicated successfully to the new slaves, but when
I bring down the old slaves, some blocks start showing up as missing or
corrupt (according to the NN UI as well as fsck*). If I restart the old
slaves, then there are no missing blocks reported by fsck.

I've tried shutting down the old slaves two by two, and for some of them I
saw no problem, but then at some point I found two slaves which, when shut
down, resulted in a couple of blocks being under-replicated (1 out of 3
replicas found). For example, fsck would report stuff like this:

/user/hive/warehouse/ads_destinations_hosts/part-m-00012:  Under replicated
BP-1207449144-10.10.10.21-1356639087818:blk_6150201737015349469_121244.
Target Replicas is 3 but found 1 replica(s).

The system then stayed in that state apparently forever. It never actually
fixed the fact some blocks were under-replicated. Does that mean there's
something wrong with some of the old datanodes...? Why do they keep block
for themselves (even thought they're decommissioned) instead of replicating
those blocks to the new (non-decommissioned) datanodes?

How do I force replication of under-replicated blocks?

*Actually, the NN UI and fsck report slightly different things. The NN UI
always seems to report 60 under-replicated blocks, whereas fsck only
reports those 60 under-replicated blocks when I shut down some of the old
datanodes... When the old nodes are up, fsck reports 0 under-replicated
blocks... This is very confusing!

Any help would be appreciated! Please don't hesitate to ask if I should
provide some of my logs, settings, or the output of some commands...!

Thanks :) !

--
Felix

--089e013d15d884898904d9019c93
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div style>Hello,</div><div style><br></div><div style>I&#=
39;ve been running a virtualized CDH 4.2 cluster. I now want to migrate all=
 my data to another (this time physical) set of slaves and then stop using =
the virtualized slaves.</div>

<div style><br></div><div style>I added the new physical slaves in the clus=
ter, and marked all the old virtualized slaves as decommissioned using the=
=A0dfs.hosts.exclude setting in hdfs-site.xml.</div><div style><br></div>
<div style>
Almost all of the data replicated successfully to the new slaves, but when =
I bring down the old slaves, some blocks start showing up as missing or cor=
rupt (according to the NN UI as well as fsck*). If I restart the old slaves=
, then there are no missing blocks reported by fsck.</div>

<div style><br></div><div style>I&#39;ve tried shutting down the old slaves=
 two by two, and for some of them I saw no problem, but then at some point =
I found two slaves which, when shut down, resulted in a couple of blocks be=
ing under-replicated (1 out of 3 replicas found). For example, fsck would r=
eport stuff like this:</div>

<div style><br></div><div style><div>/user/hive/warehouse/ads_destinations_=
hosts/part-m-00012: =A0Under replicated BP-1207449144-10.10.10.21-135663908=
7818:blk_6150201737015349469_121244. Target Replicas is 3 but found 1 repli=
ca(s).</div>

</div><div style><br></div><div style>The system then stayed in that state =
apparently forever. It never actually fixed the fact some blocks were under=
-replicated. Does that mean there&#39;s something wrong with some of the ol=
d datanodes...? Why do they keep block for themselves (even thought they=
9;re decommissioned) instead of replicating those blocks to the new (non-de=
commissioned) datanodes?</div>

<div style><br></div><div style>How do I force replication of under-replica=
ted blocks?</div><div style><br></div><div style>*Actually, the NN UI and f=
sck report slightly different things. The NN UI always seems to report 60 u=
nder-replicated blocks, whereas fsck only reports those 60 under-replicated=
 blocks when I shut down some of the old datanodes... When the old nodes ar=
e up, fsck reports 0 under-replicated blocks... This is very confusing!</di=
v>

<div style><br></div><div style>Any help would be appreciated! Please don&#=
39;t hesitate to ask if I should provide some of my logs, settings, or the =
output of some commands...!</div><div style><br></div><div style>Thanks :) =
!</div>

<br clear=3D"all"><div>--<br>Felix<br></div>
</div>

--089e013d15d884898904d9019c93--