hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <stanley....@emc.com>
Subject RE: Hadoop cluster optimization
Date Mon, 22 Aug 2011 01:46:47 GMT
Hi Avi,

I'm also learning Hadoop now. There's a tool named "nmon" that can track the usage of the
server. You can use this to track the mem, cpu, disk and network usage of the servers. It's
very easy to use and there's a nmon-analyzer that can generate excel diagrams base on the
nmon data.

Hope this helps

-----Original Message-----
From: Avi Vaknin [mailto:avivaknin13@gmail.com] 
Sent: 2011年8月21日 快下班了 7:57
To: common-user@hadoop.apache.org
Subject: Hadoop cluster optimization

Hi all !
How are you?

My name is Avi and I have been fascinated by Apache Hadoop for the last few
I am spending the last two weeks trying to optimize my configuration files
and environment.
I have been going through many Hadoop's configuration properties and it
seems that none
of them is  making a big difference (+- 3 minutes of a total job run time).

In Hadoop's meanings my cluster considered to be extremely small (260 GB of
text files, while every job is going through only +- 8 GB).
I have one server acting as "NameNode and JobTracker", and another 5 servers
acting as "DataNodes and TaskTreckers".
Right now Hadoop's configurations are set to default, beside the DFS Block
Size which is set to 256 MB since every file on my cluster takes 155 - 250

All of the above servers are exactly the same and having the following
hardware and software:
1.7 GB memory
1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
Ubuntu Server 10.10 , 32-bit platform
Cloudera CDH3 Manual Hadoop Installation
(for the ones who are familiar with Amazon Web Services, I am talking about
Small EC2 Instances/Servers)

Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
MB and 10 reduce tasks).

Based on the above information, does anyone can recommend on a best practice
Do you thinks that when dealing with such a small cluster, and when
processing such a small amount of data,
is it even possible to optimize jobs so they would run much faster? 

By the way, it seems like none of the nodes are having a hardware
performance issues (CPU/Memory) while running the job.
Thats true unless I am having a bottle neck somewhere else (seems like
network bandwidth is not the issue).
That issue is a little confusing because  the NameNode process and the
JobTracker process should allocate 1GB of memory each,
which means that my hardware starting point is insufficient and in that case
why am I not seeing a full Memory utilization using 'top' 
command on the NameNode & JobTracker Server? 
How would you recommend to measure/monitor different Hadoop's properties to
find out where is the bottle neck?

Thanks for your help!!



View raw message