Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: 209.85.216.51 is neither permitted
 nor denied by domain of discord@uw.edu)
MIME-Version: 1.0
In-Reply-To: 
 <CAChq9g3jD68ds6ov9srAAGxdd1apZZvdJCSFVkJZiRD4k2evOw@mail.gmail.com>
References: 
 <CAB-gU_tyetj-KKUjsQ8ZpTfEb1U4Nu1aSf0SU-tW9owX2mju8Q@mail.gmail.com>
	<CAChq9g3=hZhoh7EZmmQDUBVHSPwdYA56aOx8wbF0iSgsTB8WZw@mail.gmail.com>
	<CAB-gU_u1j2umO886UFdihoqAWDBnGNtPjWehgxC9whPTFx9kjQ@mail.gmail.com>
	<CAChq9g3jD68ds6ov9srAAGxdd1apZZvdJCSFVkJZiRD4k2evOw@mail.gmail.com>
Date: Thu, 16 Oct 2014 21:41:24 -0700
Message-ID: 
 <CAB-gU_tt0b1yshZ44yNSgucTYfTPQhuaa3-TvDH7v_9hX+hA7A@mail.gmail.com>
Subject: Re: decommissioning disks on a data node
From: Colin Kincaid Williams <discord@uw.edu>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a1132e65e39e64e050596fa32

--001a1132e65e39e64e050596fa32
Content-Type: text/plain; charset=UTF-8

 Hi Travis,

Thanks for your input. I forgot to mention that the drives are most likely
in the single drive configuration that you describe.

I think what I've found is that restarting the datanodes in the manner I
describe shows that the mount points on the drives with the reset blocks
and newly formatted partition have gone bad. Then I'm not sure the namenode
will use these locations, even if it does not show the volumes failed.
Without a way to reinitialize the disks, specifically the mount points, I
assume my efforts are in vain.

Therefore the only procedure that makes sense is to decommission the nodes
with which I want to bring the failed volumes back up. It just didn't make
sense to me that if we have a large number of disks with good data, that we
would end up wiping that data and starting over again.

On Oct 16, 2014 8:36 PM, "Travis" <hcoyote@ghostar.org> wrote:

>
>
> On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams <discord@uw.edu>
> wrote:
>
>> For some reason he seems intent on resetting the bad Virtual blocks, and
>> giving the drives another shot. From what he told me, nothing is under
>> warranty anymore. My first suggestion was to get rid of the disks.
>>
>> Here's the command:
>>
>> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks
>> controller=1 vdisk=$vid
>>
>
> Well, the usefulness of this action is going to entirely depend on how
> you've actually set up the virtual disks.
>
> If you've set it up so there's only one physical disk in each vdisk
> (single-disk RAID0), then the bad "virtual" block is likely going to map to
> a real bad block.
>
> If you're doing something where there are multiple disks associated with
> each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can
> exhibit what follows), it's possible for the virtual device to have a bad
> block that is actually mapped to a good physical block underneath.  This
> can happen, for example, if you had a failing drive in the vdisk and
> replaced it, but the controller had remapped the bad virtual block to some
> place good.  Replacing the drive with a good one makes the controller think
> the bad block is still there.  Dell calls it a punctured stripe (for better
> description see
> http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html).
> In this case, the fix is clearing the virtual badblock list with the above
> command.
>
>
>> I'm still curious about how hadoop blocks work. I'm assuming that each
>> block is stored on one of the many mountpoints, and not divided between
>> them. I know there is a tolerated volume failure option in hdfs-site.xml.
>>
>
> Correct.  Each HDFS block is actually treated as a file that lives on a
> regular filesystem, like ext3 or ext4.   If you did an ls inside one of
> your vdisk's, you'd see the raw blocks that the datanode is actually
> storing.  You just wouldn't be able to easily tell what file that block was
> a part of because it's named with a block id, not the actual file name.
>
>
>> Then if the operations I laid out are legitimate, specifically removing
>> the drive in question and restarting the data node. The advantage being
>> less re-replication and less downtime.
>>
>>
> Yup.  It will minimize the actual prolonged outage of the datanode
> itself.  You'll get a little re-replication while the datanode process is
> off, but if you keep that time reasonably short, you should be fine.  When
> the datanode process comes back up, it will walk all of it's configured
> filesystems determining which blocks it still has on disk and report that
> back to the namenode.  Once that happens, re-replication will stop because
> the namenode knows where those missing blocks are and no longer treat them
> as under-replicated.
>
> Note:  You'll still get some re-replication occurring for the blocks that
> lived on the drive you removed.  But it's only a drive's worth of blocks,
> not a whole datanode.
>
> Travis
> --
> Travis Campbell
> travis@ghostar.org
>

--001a1132e65e39e64e050596fa32
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><p dir=3D"ltr"> </p>
<div class=3D"gmail_quote">Hi Travis,</div><div class=3D"gmail_quote"><br><=
/div><div class=3D"gmail_quote">Thanks for your input. I forgot to mention =
that the drives are most likely in the single drive configuration that you =
describe.</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quo=
te">I think what I&#39;ve found is that restarting the datanodes in the man=
ner I describe shows that the mount points on the drives with the reset blo=
cks and newly formatted partition have gone bad. Then I&#39;m not sure the =
namenode will use these locations, even if it does not show the volumes fai=
led. Without a way to reinitialize the disks, specifically the mount points=
, I assume my efforts are in vain.</div><div class=3D"gmail_quote"><br></di=
v><div class=3D"gmail_quote">Therefore the only procedure that makes sense =
is to decommission the nodes with which I want to bring the failed volumes =
back up. It just didn&#39;t make sense to me that if we have a large number=
 of disks with good data, that we would end up wiping that data and startin=
g over again.</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail=
_quote">On Oct 16, 2014 8:36 PM, &quot;Travis&quot; &lt;<a href=3D"mailto:h=
coyote@ghostar.org" target=3D"_blank">hcoyote@ghostar.org</a>&gt; wrote:<br=
 type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><br><=
div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Oct 16, 20=
14 at 10:01 PM, Colin Kincaid Williams <span dir=3D"ltr">&lt;<a href=3D"mai=
lto:discord@uw.edu" target=3D"_blank">discord@uw.edu</a>&gt;</span> wrote:<=
br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bord=
er-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:soli=
d;padding-left:1ex"><div dir=3D"ltr">For some reason he seems intent on res=
etting the bad Virtual blocks, and giving the drives another shot. From wha=
t he told me, nothing is under warranty anymore. My first suggestion was to=
 get rid of the disks.<div>=C2=A0</div><div>Here&#39;s the command:</div><d=
iv><br></div><div>/opt/dell/srvadmin/bin/omconfig storage vdisk action=3Dcl=
earvdbadblocks controller=3D1 vdisk=3D$vid<br></div></div></blockquote><div=
><br></div><div>Well, the usefulness of this action is going to entirely de=
pend on how you&#39;ve actually set up the virtual disks.</div><div><br></d=
iv><div>If you&#39;ve set it up so there&#39;s only one physical disk in ea=
ch vdisk (single-disk RAID0), then the bad &quot;virtual&quot; block is lik=
ely going to map to a real bad block.</div><div><br></div><div>If you&#39;r=
e doing something where there are multiple disks associated with each virtu=
al disk (eg, RAID1, RAID10 ... can&#39;t remember if RAID5/RAID6 can exhibi=
t what follows), it&#39;s possible for the virtual device to have a bad blo=
ck that is actually mapped to a good physical block underneath.=C2=A0 This =
can happen, for example, if you had a failing drive in the vdisk and replac=
ed it, but the controller had remapped the bad virtual block to some place =
good.=C2=A0 Replacing the drive with a good one makes the controller think =
the bad block is still there.=C2=A0 Dell calls it a punctured stripe (for b=
etter description see=C2=A0<a href=3D"http://lists.us.dell.com/pipermail/li=
nux-poweredge/2010-December/043832.html" target=3D"_blank">http://lists.us.=
dell.com/pipermail/linux-poweredge/2010-December/043832.html</a>).=C2=A0 In=
 this case, the fix is clearing the virtual badblock list with the above co=
mmand.</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204=
);border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><div></div><di=
v><br></div><div>I&#39;m still curious about how hadoop blocks work. I&#39;=
m assuming that each block is stored on one of the many mountpoints, and no=
t divided between them. I know there is a tolerated volume failure option i=
n hdfs-site.xml.=C2=A0</div></div></blockquote><div><br></div><div>Correct.=
=C2=A0 Each HDFS block is actually treated as a file that lives on a regula=
r filesystem, like ext3 or ext4. =C2=A0 If you did an ls inside one of your=
 vdisk&#39;s, you&#39;d see the raw blocks that the datanode is actually st=
oring.=C2=A0 You just wouldn&#39;t be able to easily tell what file that bl=
ock was a part of because it&#39;s named with a block id, not the actual fi=
le name.</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,2=
04);border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><div><br></d=
iv><div>Then if the operations I laid out are legitimate, specifically remo=
ving the drive in question and restarting the data node. The advantage bein=
g less re-replication and less downtime.=C2=A0</div></div><div><div><div cl=
ass=3D"gmail_extra"><br></div></div></div></blockquote><div><br></div><div>=
Yup.=C2=A0 It will minimize the actual prolonged outage of the datanode its=
elf.=C2=A0 You&#39;ll get a little re-replication while the datanode proces=
s is off, but if you keep that time reasonably short, you should be fine.=
=C2=A0 When the datanode process comes back up, it will walk all of it&#39;=
s configured filesystems determining which blocks it still has on disk and =
report that back to the namenode.=C2=A0 Once that happens, re-replication w=
ill stop because the namenode knows where those missing blocks are and no l=
onger treat them as under-replicated.</div><div><br></div><div>Note: =C2=A0=
You&#39;ll still get some re-replication occurring for the blocks that live=
d on the drive you removed.=C2=A0 But it&#39;s only a drive&#39;s worth of =
blocks, not a whole datanode.</div><div><br></div><div>Travis</div></div>--=
 <br>Travis Campbell<br><a href=3D"mailto:travis@ghostar.org" target=3D"_bl=
ank">travis@ghostar.org</a>
</div></div>
</blockquote></div>
</div>

--001a1132e65e39e64e050596fa32--