hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Kim <bbuil...@gmail.com>
Subject Re: Decommissioning Nodes in Production Cluster.
Date Tue, 12 Feb 2013 16:46:06 GMT

I would like to add another scenario. What are the steps for removing a 
dead node when the server had a hard failure that is unrecoverable.


On Tuesday, February 12, 2013 7:30:57 AM UTC-8, sudhakara st wrote:
> The decommissioning process is controlled by an exclude file, which for 
> HDFS is set by the* dfs.hosts.exclude* property, and for MapReduce by the*mapred.hosts.exclude
> * property. In most cases, there is one shared file,referred to as the 
> exclude file.This  exclude file name should be specified as a configuration 
> parameter *dfs.hosts.exclude *in the name node start up.
> To remove nodes from the cluster:
> 1. Add the network addresses of the nodes to be decommissioned to the 
> exclude file.
> 2. Restart the MapReduce cluster to stop the tasktrackers on the nodes 
> being
> decommissioned.
> 3. Update the namenode with the new set of permitted datanodes, with this
> command:
> % hadoop dfsadmin -refreshNodes
> 4. Go to the web UI and check whether the admin state has changed to 
> “Decommission
> In Progress” for the datanodes being decommissioned. They will start 
> copying
> their blocks to other datanodes in the cluster.
> 5. When all the datanodes report their state as “Decommissioned,” then all 
> the blocks
> have been replicated. Shut down the decommissioned nodes.
> 6. Remove the nodes from the include file, and run:
> % hadoop dfsadmin -refreshNodes
> 7. Remove the nodes from the slaves file.
>  Decommission data nodes in small percentage(less than 2%) at time don't 
> cause any effect on cluster. But it better to pause MR-Jobs before you 
> triggering Decommission to ensure  no task running in decommissioning 
> subjected nodes.
>  If very small percentage of task running in the decommissioning node it 
> can submit to other task tracker, but percentage queued jobs  larger then 
> threshold  then there is chance of job failure. Once triggering the 'hadoop 
> dfsadmin -refreshNodes' command and decommission started, you can resume 
> the MR jobs.
> *Source : The Definitive Guide [Tom White]*
> On Tuesday, February 12, 2013 5:20:07 PM UTC+5:30, Dhanasekaran Anbalagan 
> wrote:
>> Hi Guys,
>> It's recommenced do with removing one the datanode in production cluster.
>> via Decommission the particular datanode. please guide me.
>> -Dhanasekaran,
>> Did I learn something today? If not, I wasted it.
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message