hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis <hcoy...@ghostar.org>
Subject Re: decommissioning disks on a data node
Date Fri, 17 Oct 2014 05:49:43 GMT
On Thu, Oct 16, 2014 at 11:41 PM, Colin Kincaid Williams <discord@uw.edu>

>  Hi Travis,
> Thanks for your input. I forgot to mention that the drives are most likely
> in the single drive configuration that you describe.

Then clearing the virtual badblock list is unlikely to do anything useful
if the drive itself already has failed sectors.  One thing you can do is

/opt/srvadmin/bin/omreport storage pdisk controller=$controller_number

and check to see if any of them show as a non-critical state (as opposed to
Ok) or failed state.  If it's reporting non-critical, this means that OMSA
is treating it as going to fail in the near future.  There's a script
written to work as a nagios check called check_openmanage.  It has a nice
way of consolidating all of the information that OMSA reports on into a
useful and concise report of what's actually happening with your Dell

It's available here:


These two things will at least give you some indication on whether or not
you'll be replacing the drive in the near future.

> I think what I've found is that restarting the datanodes in the manner I
> describe shows that the mount points on the drives with the reset blocks
> and newly formatted partition have gone bad. Then I'm not sure the namenode
> will use these locations, even if it does not show the volumes failed.
> Without a way to reinitialize the disks, specifically the mount points, I
> assume my efforts are in vain.

So, there's really three ways that bad blocks can get dealt with:

1.  physical drive identifies a bad block via SMART and remaps it to a
reserved block.  There's a limited supply of these.  This happens
automatically within the drive's own firmware controller.
2.  virtual drive identifies a bad block via the raid controller and remaps
it somewhere, probably to a section of blocks reserved by the controller
for this.  Similar to #1.  In case of multi-disk raids, the bad block
identified that lives on disk A could potentially be remapped to a good
block on disk B because of the virtual disk block remapping.
3.  mkfs identifies bad blocks during filesystem creation and prevents data
from being written there.

I'm not sure what the actual recovery behavior is for #2 in the case of
single-disk raid0.

If #1 occurs, the drive should just be replaced.  If you absolutely can't
replace it, you can try doing #3 (assuming you use ext3/ext4; not sure
about how to do it with other filesystems), but don't be surprised if it
doesn't work or if you begin having mysterious data corruption as more and
more sectors fail on the disk.

> Therefore the only procedure that makes sense is to decommission the nodes
> with which I want to bring the failed volumes back up. It just didn't make
> sense to me that if we have a large number of disks with good data, that we
> would end up wiping that data and starting over again.
You don't need to decom the whole thing just to replace a single disk.

There is one case where doing a full decom is useful and that has to do
with the "risk" that the replacement drive can become a hotspot depending
on how you've configured the block placement policy that the datanode uses
to determine how to fill drives up.  In this particular case, if you have
hotspotted drives because the datanode is choosing to place all new blocks
on the new drive until it equalizes in usage compared to other drives in
the system, you could run into performance issues.  In practice, it
probably doesn't matter much.  "Fixing" the problem means doing a full
decommission, then adding the node back in and running a full cluster
rebalance with the Balancer.

There's an interesting discussion about intra-datanode block placement at

In my cluster, we almost never go this route when replacing just one disk
in a system.  We have 12 disks in each node, so replacing 1 means only 9%
of the data *on that node* could potentially run into this.  And with the
size of our cluster that's somewhere below 0.1% of the data that could be
affected.  Just not worth worrying about it.

Anyhow, replace the disk.  You'll be a happier Hadoop user then. :-)

Travis Campbell

View raw message