Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of avivaknin13@gmail.com
 designates 209.85.161.48 as permitted sender)
From: "Avi Vaknin" <avivaknin13@gmail.com>
To: <common-user@hadoop.apache.org>
Subject: Hadoop cluster optimization
Date: Sun, 21 Aug 2011 14:57:16 +0300
Message-ID: <000001cc5ff9$7b4ab210$71e01630$@com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_0001_01CC6012.A097EA10"
Thread-Index: Acxf+XiOrCxrw1IlQkekfIIPPTGG2g==
Content-Language: he

------=_NextPart_000_0001_01CC6012.A097EA10
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hi all !
How are you?

My name is Avi and I have been fascinated by Apache Hadoop for the last few
months.
I am spending the last two weeks trying to optimize my configuration files
and environment.
I have been going through many Hadoop's configuration properties and it
seems that none
of them is  making a big difference (+- 3 minutes of a total job run time).

In Hadoop's meanings my cluster considered to be extremely small (260 GB of
text files, while every job is going through only +- 8 GB).
I have one server acting as "NameNode and JobTracker", and another 5 servers
acting as "DataNodes and TaskTreckers".
Right now Hadoop's configurations are set to default, beside the DFS Block
Size which is set to 256 MB since every file on my cluster takes 155 - 250
MB.

All of the above servers are exactly the same and having the following
hardware and software:
1.7 GB memory
1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
Ubuntu Server 10.10 , 32-bit platform
Cloudera CDH3 Manual Hadoop Installation
(for the ones who are familiar with Amazon Web Services, I am talking about
Small EC2 Instances/Servers)

Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
MB and 10 reduce tasks).

Based on the above information, does anyone can recommend on a best practice
configuration??
Do you thinks that when dealing with such a small cluster, and when
processing such a small amount of data,
is it even possible to optimize jobs so they would run much faster? 

By the way, it seems like none of the nodes are having a hardware
performance issues (CPU/Memory) while running the job.
Thats true unless I am having a bottle neck somewhere else (seems like
network bandwidth is not the issue).
That issue is a little confusing because  the NameNode process and the
JobTracker process should allocate 1GB of memory each,
which means that my hardware starting point is insufficient and in that case
why am I not seeing a full Memory utilization using 'top' 
command on the NameNode & JobTracker Server? 
How would you recommend to measure/monitor different Hadoop's properties to
find out where is the bottle neck?

Thanks for your help!!

Avi

 
------=_NextPart_000_0001_01CC6012.A097EA10--