hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Segel, Mike" <mse...@navteq.com>
Subject RE: Manage a cluster where not all machines are always available
Date Tue, 18 Jan 2011 01:33:30 GMT

Charles,

If I understand you correctly you want to trim the cluster down to only those machines that
you control...

Ok... Do you care about the data that is currently on the cluster? 
(Is all of the data yours, or replaceable?)

Can you easily copy the data off the cluster on to plain old unix file system disk space?

If not, then you have to do the following on a NODE by NODE Basis...
A) Put the node in the dfs.exlude file and remove from the slaves file.
B) As root run killall -9 java to stop any java from running. (It will end your datanode and
tasktracker jobs.)
C) Wait 10 mins until the job tracker and name node see the node as down.
D) Run a hadoop fsck / to find all of the files that are now missing a replication.
E) Run balancer to replicate the missing blocks on a different machine.

Of course, it would help if you upped the bandwidth used by the balancer to a large number.
Normally the balancer is supposed to run in the background, so by default its something like
1 MB/sec. If you've got a 1GB ethernet link, you could easily push that number up to 100 or
200 MB/sec. Then when you run the balancer, it moves!

Note: When we tried decommissioning nodes, I don't know if we had changed this parameter,
but it was taking 'weeks' to decommission a node. (Your Mileage May Vary). Not sure if the
long time was due to this parameter being so low, or something else. 

What I listed above should work. (Even if it is a bit ugly.)

HTH

-Mike

________________________________________
From: Charles Gonçalves [charles.fg@gmail.com]
Sent: Monday, January 17, 2011 7:07 PM
To: general@hadoop.apache.org
Subject: Manage a cluster where not all machines are always available

Hi Guys,

I'm running a series of pig scripts in a cluster with a dozen of machines.
The problem is that those machines belongs to a lab in my University and
sometimes not all them are available for my use.
What is the best approach to manage the configuration and the data on hdfs
on this enviroment?

Can I simply remove the busy servers from the slaves file and start the hdfs
and mapred  and if needed perform a :
hadoop balancer

Can you see a problem in this approach ?
Can anyone see another way!?




--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840


The information contained in this communication may be CONFIDENTIAL and is intended only for
the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby
notified that any dissemination, distribution, or copying of this communication, or any of
its contents, is strictly prohibited.  If you have received this communication in error, please
notify the sender and delete/destroy the original message and any copy of it from your computer
or paper files.
Mime
View raw message