Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 024C510770 for ; Fri, 17 Oct 2014 03:02:28 +0000 (UTC) Received: (qmail 18358 invoked by uid 500); 17 Oct 2014 03:02:22 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 18229 invoked by uid 500); 17 Oct 2014 03:02:22 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 18219 invoked by uid 99); 17 Oct 2014 03:02:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Oct 2014 03:02:21 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 209.85.216.41 is neither permitted nor denied by domain of discord@uw.edu) Received: from [209.85.216.41] (HELO mail-qa0-f41.google.com) (209.85.216.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Oct 2014 03:01:56 +0000 Received: by mail-qa0-f41.google.com with SMTP id n8so3339838qaq.0 for ; Thu, 16 Oct 2014 20:01:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=5UV+YAmqtqf6Kvy7PCveb46GlT/jlbe9mGljluvtbmQ=; b=Gdc/phazctRkPCKhzy6jXJX1MDMX+w0sAC+akd+3l3pX6E/OCpCgoZMvPmiZlz3ITr SwZxMN6jAy4yY3Lp87kIRCX6QejrUrrSrG8I5eIygQr0vsbmimFFxAJrc8bKv/oAM9zx K2+8nzaq2zzVmP6eDmSfSHY07x/ha1mzMaftwcKZ4/Vl/h7o+Ijm5qJbmniIpym4OtF7 HsfdZyGgNXN/ebLGmO+JefeAIJ8sUFp6VLW5t8KprxoAHN0Mmmn0JB8v7kaYgIoP99uH qYq2EWemb6/tEsL+zyuHvANnQIgpiwlXbCwhfbd6Fs3kImx++8Y9nRL60UhXwbzGEZ0i CIIA== X-Gm-Message-State: ALoCoQnRjlus75Snf/MqXE3g6ZkB5B6DG+W5ikNHDNcrlF3QC3UQ1fe32yP5tbYCdhEZ8hgOJS39 MIME-Version: 1.0 X-Received: by 10.140.94.50 with SMTP id f47mr7836286qge.50.1413514914801; Thu, 16 Oct 2014 20:01:54 -0700 (PDT) Received: by 10.140.96.76 with HTTP; Thu, 16 Oct 2014 20:01:54 -0700 (PDT) In-Reply-To: References: Date: Thu, 16 Oct 2014 20:01:54 -0700 Message-ID: Subject: Re: decommissioning disks on a data node From: Colin Kincaid Williams To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a113a989661c25a0505959691 X-Virus-Checked: Checked by ClamAV on apache.org --001a113a989661c25a0505959691 Content-Type: text/plain; charset=UTF-8 For some reason he seems intent on resetting the bad Virtual blocks, and giving the drives another shot. From what he told me, nothing is under warranty anymore. My first suggestion was to get rid of the disks. Here's the command: /opt/dell/srvadmin/bin/omconfig storage vdisk action=clearvdbadblocks controller=1 vdisk=$vid I'm still curious about how hadoop blocks work. I'm assuming that each block is stored on one of the many mountpoints, and not divided between them. I know there is a tolerated volume failure option in hdfs-site.xml. Then if the operations I laid out are legitimate, specifically removing the drive in question and restarting the data node. The advantage being less re-replication and less downtime. On Thu, Oct 16, 2014 at 6:58 PM, Travis wrote: > > > On Thu, Oct 16, 2014 at 7:03 PM, Colin Kincaid Williams > wrote: > >> We have been seeing some of the disks on our cluster having bad blocks, >> and then failing. We are using some dell PERC H700 disk controllers that >> create "virtual devices". >> >> > Are you doing a bunch of single-disk RAID0 devices with the PERC to mimic > JBOD? > > >> Our hosting manager uses a dell utility which reports "virtual device bad >> blocks". He has suggested that we use the dell tool to remove the "virtual >> device bad blocks", and then re-format the device. >> > > Which Dell tool is he using for this? the OMSA tools? In practice, if > OMSA is telling you the drive is bad, it's likely already exhausted all the > available reserved blocks that it could use to remap bad blocks and > probably not worth messing with the drive. Just get Dell to replace it > (assuming your hardware is under warranty or support). > > >> >> I'm wondering if we can remove the disks in question from the >> hdfs-site.xml, and restart the datanode , so that we don't re-replicate the >> hadoop blocks on the other disks. Then we would go ahead and work on the >> troubled disk, while the datanode remained up. Finally we would restart the >> datanode again after re-adding the freshly formatted { possibly new } disk. >> This way the data on the remaining disks doesn't get re-replicated. >> >> I don't know too much about the hadoop block system. Will this work ? Is >> it an acceptable strategy for disk maintenance ? >> > > The data may still re-replicate from the missing disk within your cluster > if the namenode determines that those blocks are under-replicated. > > Unless your cluster is so tight on space that you couldn't handle taking > one disk out for maintenance, the re-replication of blocks from the missing > disk within the cluster should be fine. You don't need to keep the entire > datanode down through out the entire time you're running tests on the > drive. The process you laid out is basically how we manage disk > maintenance on our Dells: stopping the datanode, unmounting the broken > drive, modifying the hdfs-site.xml for that node, and restarting it. > > I've automated some of this process with puppet by taking advantage of > ext3/ext4's ability to set a label on the partition that puppet looks for > when configuring mapred-site.xml and hdfs-site.xml. I talk about it in a > few blog posts from a few years back if you're interested. > > http://www.ghostar.org/2011/03/hadoop-facter-and-the-puppet-marionette/ > > http://www.ghostar.org/2013/05/using-cobbler-with-a-fast-file-system-creation-snippet-for-kickstart-post-install/ > > > Cheers, > Travis > -- > Travis Campbell > travis@ghostar.org > --001a113a989661c25a0505959691 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
For some reason he seems intent on resetting the bad Virtu= al blocks, and giving the drives another shot. From what he told me, nothin= g is under warranty anymore. My first suggestion was to get rid of the disk= s.
=C2=A0
Here's the command:

/o= pt/dell/srvadmin/bin/omconfig storage vdisk action=3Dclearvdbadblocks contr= oller=3D1 vdisk=3D$vid

I'm still curious a= bout how hadoop blocks work. I'm assuming that each block is stored on = one of the many mountpoints, and not divided between them. I know there is = a tolerated volume failure option in hdfs-site.xml.=C2=A0

Then if the operations I laid out are legitimate, specifically remo= ving the drive in question and restarting the data node. The advantage bein= g less re-replication and less downtime.=C2=A0

On Thu, Oct 16, 2014 at 6:58 PM, T= ravis <hcoyote@ghostar.org> wrote:


On Thu, Oct 16, 2014 at 7:03 PM, Colin K= incaid Williams <discord@uw.edu> wrote:
We have been seeing some of the disks on our cluster having b= ad blocks, and then failing. We are using some dell PERC H700 disk controll= ers that create "virtual devices".=C2=A0


Are you doing a bunch of single-disk RA= ID0 devices with the PERC to mimic JBOD?
=C2=A0<= /div>
Our hosting manager = uses a dell utility which reports "virtual device bad blocks". He= has suggested that we use the dell tool to remove the "virtual device= bad blocks", and then re-format the device.=C2=A0

Which Dell tool is he using for this? =C2= =A0the OMSA tools?=C2=A0 In practice, if OMSA is telling you the drive is b= ad, it's likely already exhausted all the available reserved blocks tha= t it could use to remap bad blocks and probably not worth messing with the = drive.=C2=A0 Just get Dell to replace it (assuming your hardware is under w= arranty or support).=C2=A0
=C2=A0

=C2=A0I'm wondering if we = can remove the disks in question from the hdfs-site.xml, and restart the da= tanode , so that we don't re-replicate the hadoop blocks on the other d= isks. Then we would go ahead and work on the troubled disk, while the datan= ode remained up. Finally we would restart the datanode again after re-addin= g the freshly formatted { possibly new } disk. This way the data on the rem= aining disks doesn't get re-replicated.=C2=A0

= I don't know too much about the hadoop block system. Will this work ? I= s it an acceptable strategy for disk maintenance ?

The data may still re-replicate from the miss= ing disk within your cluster if the namenode determines that those blocks a= re under-replicated. =C2=A0

Unless your cluster is so tight on space that you cou= ldn't handle taking one disk out for maintenance, the re-replication of= blocks from the missing disk within the cluster should be fine. =C2=A0 You= don't need to keep the entire datanode down through out the entire tim= e you're running tests on the drive.=C2=A0 The process you laid out is = basically how we manage disk maintenance on our Dells: =C2=A0stopping the d= atanode, unmounting the broken drive, modifying the hdfs-site.xml for that = node, and restarting it. =C2=A0

<= div class=3D"gmail_extra">I've automated some of this process with pupp= et by taking advantage of ext3/ext4's ability to set a label on the par= tition that puppet looks for when configuring mapred-site.xml and hdfs-site= .xml.=C2=A0 I talk about it in a few blog posts from a few years back if yo= u're interested.



Cheers,
Travis
-- Travis Campbell
travis@ghostar.org

--001a113a989661c25a0505959691--