hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: decommissioning node woes
Date Fri, 18 Mar 2011 16:59:19 GMT

Uhmm...

If you use the default bandwidth allocation and you have a lot of data on the node you want
to decommission you can be waiting for weeks before you can safely take the node out.
If you wanted to, you can take the nodes down one by one where you do an fsck in between the
removal of nodes to get the under replicated blocks identified and replicated.
("Normally Namenode
      automatically corrects most of the recoverable failures.") 

Once you see those blocks successfully replicated... you can take down the next.

Is it clean? No, not really.
Is it dangerous? No, not really.
Do I recommend it? No, but its a quick and dirty way of doing things... 

Or you can up your dfs.balance.bandwidthPerSecIn the configuration files. The default is pretty
low.

The downside is that you have to bounce the cloud to get this value updated, and it could
have a negative impact on performance if set too high.

HTH

-Mike



> From: tdunning@maprtech.com
> Date: Fri, 18 Mar 2011 09:38:31 -0700
> Subject: Re: decommissioning node woes
> To: common-user@hadoop.apache.org
> CC: james@tynt.com
> 
> Unless the last copy is on that node.
> 
> Decommissioning is the only safe way to shut off 10 nodes at once.  Doing
> them one at a time and waiting for replication to (asymptotically) recover
> is painful and error prone.
> 
> On Fri, Mar 18, 2011 at 9:08 AM, James Seigel <james@tynt.com> wrote:
> 
> > Just a note.  If you just shut the node off, the blocks will replicate
> > faster.
> >
> > James.
> >
> >
> > On 2011-03-18, at 10:03 AM, Ted Dunning wrote:
> >
> > > If nobody else more qualified is willing to jump in, I can at least
> > provide
> > > some pointers.
> > >
> > > What you describe is a bit surprising.  I have zero experience with any
> > 0.21
> > > version, but decommissioning was working well
> > > in much older versions, so this would be a surprising regression.
> > >
> > > The observations you have aren't all inconsistent with how
> > decommissioning
> > > should work.  The fact that your nodes look up
> > > after starting the decommissioning isn't so strange.  The idea is that no
> > > new data will be put on the node, nor should it be
> > > counted as a replica, but it will help in reading data.
> > >
> > > So that isn't such a big worry.
> > >
> > > The fact that it takes forever and a day, however, is a big worry.  I
> > cannot
> > > provide any help there just off hand.
> > >
> > > What happens when a datanode goes down?  Do you see under-replicated
> > files?
> > > Does the number of such files decrease over time?
> > >
> > > On Fri, Mar 18, 2011 at 4:23 AM, Rita <rmorgan466@gmail.com> wrote:
> > >
> > >> Any help?
> > >>
> > >>
> > >> On Wed, Mar 16, 2011 at 9:36 PM, Rita <rmorgan466@gmail.com> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I have been struggling with decommissioning data  nodes. I have a 50+
> > >> data
> > >>> node cluster (no MR) with each server holding about 2TB of storage.
I
> > >> split
> > >>> the nodes into 2 racks.
> > >>>
> > >>>
> > >>> I edit the 'exclude' file and then do a -refreshNodes. I see the node
> > >>> immediate in 'Decommiosied node' and I also see it as a 'live' node!
> > >>> Eventhough I wait 24+ hours its still like this. I am suspecting its
a
> > >> bug
> > >>> in my version.  The data node process is still running on the node
I am
> > >>> trying to decommission. So, sometimes I kill -9 the process and I see
> > the
> > >>> 'under replicated' blocks...this can't be the normal procedure.
> > >>>
> > >>> There were even times that I had corrupt blocks because I was impatient
> > >> --
> > >>> waited 24-34 hours
> > >>>
> > >>> I am using 23 August, 2010: release 0.21.0 <
> > >>
> > http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available
> > >>>
> > >>> version.
> > >>>
> > >>> Is this a known bug? Is there anything else I need to do to
> > decommission
> > >> a
> > >>> node?
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> --- Get your facts first, then you can distort them as you please.--
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> --- Get your facts first, then you can distort them as you please.--
> > >>
> >
> >
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message