Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B4C9B10957 for ; Fri, 17 Oct 2014 04:41:56 +0000 (UTC) Received: (qmail 29337 invoked by uid 500); 17 Oct 2014 04:41:50 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 29204 invoked by uid 500); 17 Oct 2014 04:41:50 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 29194 invoked by uid 99); 17 Oct 2014 04:41:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Oct 2014 04:41:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 209.85.216.51 is neither permitted nor denied by domain of discord@uw.edu) Received: from [209.85.216.51] (HELO mail-qa0-f51.google.com) (209.85.216.51) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Oct 2014 04:41:45 +0000 Received: by mail-qa0-f51.google.com with SMTP id k15so39343qaq.10 for ; Thu, 16 Oct 2014 21:41:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=iNv1MM0VBUFEdoXuhq5+BHPJEXz8xUV23O4StiEmeUk=; b=DB7EK/wfaVgxZWnDCnSoIYQ+yuLwo51VXMpqX+o/Ls5OauVLc5NUCR9JKa9lSkuqAo uSalqgffneLA8uOzkV51SUq9GfJVwfP/djakZmw9gvIa7uMS8Eee/jhATdHLktQHEWvb FSjUkSaTVMaJXlztXaUTYNFKTaeSTtN2skbmGQH27dF0xa09U7dRC3ueGE3GxUs0EmA/ zJOIV2DEROcLilBKEGDqhw+Fv3Owino/k88y01X34rwCtcOxPbBzq3CyrCDpTeONKg67 gQh69c02k/t5+EuYn1+j1pha2Ylg6xztWuqtHRrlO9rUUxyoz6/mgWcnUT4oDIWE8cat VmLQ== X-Gm-Message-State: ALoCoQkaPgLKAcIzlRbvWkTzqidNyCWh8e4oRajqPW4XDIvn8ScwcPCWL1BsNWiaAFyrxFy5y3Vy MIME-Version: 1.0 X-Received: by 10.229.129.1 with SMTP id m1mr8718937qcs.30.1413520884877; Thu, 16 Oct 2014 21:41:24 -0700 (PDT) Received: by 10.140.96.76 with HTTP; Thu, 16 Oct 2014 21:41:24 -0700 (PDT) In-Reply-To: References: Date: Thu, 16 Oct 2014 21:41:24 -0700 Message-ID: Subject: Re: decommissioning disks on a data node From: Colin Kincaid Williams To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a1132e65e39e64e050596fa32 X-Virus-Checked: Checked by ClamAV on apache.org --001a1132e65e39e64e050596fa32 Content-Type: text/plain; charset=UTF-8 Hi Travis, Thanks for your input. I forgot to mention that the drives are most likely in the single drive configuration that you describe. I think what I've found is that restarting the datanodes in the manner I describe shows that the mount points on the drives with the reset blocks and newly formatted partition have gone bad. Then I'm not sure the namenode will use these locations, even if it does not show the volumes failed. Without a way to reinitialize the disks, specifically the mount points, I assume my efforts are in vain. Therefore the only procedure that makes sense is to decommission the nodes with which I want to bring the failed volumes back up. It just didn't make sense to me that if we have a large number of disks with good data, that we would end up wiping that data and starting over again. On Oct 16, 2014 8:36 PM, "Travis" wrote: > > > On Thu, Oct 16, 2014 at 10:01 PM, Colin Kincaid Williams > wrote: > >> For some reason he seems intent on resetting the bad Virtual blocks, and >> giving the drives another shot. From what he told me, nothing is under >> warranty anymore. My first suggestion was to get rid of the disks. >> >> Here's the command: >> >> /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks >> controller=1 vdisk=$vid >> > > Well, the usefulness of this action is going to entirely depend on how > you've actually set up the virtual disks. > > If you've set it up so there's only one physical disk in each vdisk > (single-disk RAID0), then the bad "virtual" block is likely going to map to > a real bad block. > > If you're doing something where there are multiple disks associated with > each virtual disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can > exhibit what follows), it's possible for the virtual device to have a bad > block that is actually mapped to a good physical block underneath. This > can happen, for example, if you had a failing drive in the vdisk and > replaced it, but the controller had remapped the bad virtual block to some > place good. Replacing the drive with a good one makes the controller think > the bad block is still there. Dell calls it a punctured stripe (for better > description see > http://lists.us.dell.com/pipermail/linux-poweredge/2010-December/043832.html). > In this case, the fix is clearing the virtual badblock list with the above > command. > > >> I'm still curious about how hadoop blocks work. I'm assuming that each >> block is stored on one of the many mountpoints, and not divided between >> them. I know there is a tolerated volume failure option in hdfs-site.xml. >> > > Correct. Each HDFS block is actually treated as a file that lives on a > regular filesystem, like ext3 or ext4. If you did an ls inside one of > your vdisk's, you'd see the raw blocks that the datanode is actually > storing. You just wouldn't be able to easily tell what file that block was > a part of because it's named with a block id, not the actual file name. > > >> Then if the operations I laid out are legitimate, specifically removing >> the drive in question and restarting the data node. The advantage being >> less re-replication and less downtime. >> >> > Yup. It will minimize the actual prolonged outage of the datanode > itself. You'll get a little re-replication while the datanode process is > off, but if you keep that time reasonably short, you should be fine. When > the datanode process comes back up, it will walk all of it's configured > filesystems determining which blocks it still has on disk and report that > back to the namenode. Once that happens, re-replication will stop because > the namenode knows where those missing blocks are and no longer treat them > as under-replicated. > > Note: You'll still get some re-replication occurring for the blocks that > lived on the drive you removed. But it's only a drive's worth of blocks, > not a whole datanode. > > Travis > -- > Travis Campbell > travis@ghostar.org > --001a1132e65e39e64e050596fa32 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Hi Travis,

<= /div>
Thanks for your input. I forgot to mention = that the drives are most likely in the single drive configuration that you = describe.

I think what I've found is that restarting the datanodes in the man= ner I describe shows that the mount points on the drives with the reset blo= cks and newly formatted partition have gone bad. Then I'm not sure the = namenode will use these locations, even if it does not show the volumes fai= led. Without a way to reinitialize the disks, specifically the mount points= , I assume my efforts are in vain.

Therefore the only procedure that makes sense = is to decommission the nodes with which I want to bring the failed volumes = back up. It just didn't make sense to me that if we have a large number= of disks with good data, that we would end up wiping that data and startin= g over again.

On Oct 16, 2014 8:36 PM, "Travis" <hcoyote@ghostar.org> wrote:

<= div class=3D"gmail_extra">
On Thu, Oct 16, 20= 14 at 10:01 PM, Colin Kincaid Williams <discord@uw.edu> wrote:<= br>
For some reason he seems intent on res= etting the bad Virtual blocks, and giving the drives another shot. From wha= t he told me, nothing is under warranty anymore. My first suggestion was to= get rid of the disks.
=C2=A0
Here's the command:

/opt/dell/srvadmin/bin/omconfig storage vdisk action=3Dcl= earvdbadblocks controller=3D1 vdisk=3D$vid

Well, the usefulness of this action is going to entirely de= pend on how you've actually set up the virtual disks.

If you've set it up so there's only one physical disk in ea= ch vdisk (single-disk RAID0), then the bad "virtual" block is lik= ely going to map to a real bad block.

If you'r= e doing something where there are multiple disks associated with each virtu= al disk (eg, RAID1, RAID10 ... can't remember if RAID5/RAID6 can exhibi= t what follows), it's possible for the virtual device to have a bad blo= ck that is actually mapped to a good physical block underneath.=C2=A0 This = can happen, for example, if you had a failing drive in the vdisk and replac= ed it, but the controller had remapped the bad virtual block to some place = good.=C2=A0 Replacing the drive with a good one makes the controller think = the bad block is still there.=C2=A0 Dell calls it a punctured stripe (for b= etter description see=C2=A0http://lists.us.= dell.com/pipermail/linux-poweredge/2010-December/043832.html).=C2=A0 In= this case, the fix is clearing the virtual badblock list with the above co= mmand.


I'm still curious about how hadoop blocks work. I'= m assuming that each block is stored on one of the many mountpoints, and no= t divided between them. I know there is a tolerated volume failure option i= n hdfs-site.xml.=C2=A0

Correct.= =C2=A0 Each HDFS block is actually treated as a file that lives on a regula= r filesystem, like ext3 or ext4. =C2=A0 If you did an ls inside one of your= vdisk's, you'd see the raw blocks that the datanode is actually st= oring.=C2=A0 You just wouldn't be able to easily tell what file that bl= ock was a part of because it's named with a block id, not the actual fi= le name.


Then if the operations I laid out are legitimate, specifically remo= ving the drive in question and restarting the data node. The advantage bein= g less re-replication and less downtime.=C2=A0


= Yup.=C2=A0 It will minimize the actual prolonged outage of the datanode its= elf.=C2=A0 You'll get a little re-replication while the datanode proces= s is off, but if you keep that time reasonably short, you should be fine.= =C2=A0 When the datanode process comes back up, it will walk all of it'= s configured filesystems determining which blocks it still has on disk and = report that back to the namenode.=C2=A0 Once that happens, re-replication w= ill stop because the namenode knows where those missing blocks are and no l= onger treat them as under-replicated.

Note: =C2=A0= You'll still get some re-replication occurring for the blocks that live= d on the drive you removed.=C2=A0 But it's only a drive's worth of = blocks, not a whole datanode.

Travis
--=
Travis Campbell
travis@ghostar.org
--001a1132e65e39e64e050596fa32--