Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MailScanner-NULL-Check: 1295867328.05222@rnMANIRIJLMGkRopZ8j1bA
Message-ID: <4D34233F.4070902@apache.org>
Date: Mon, 17 Jan 2011 11:08:47 +0000
From: Steve Loughran <stevel@apache.org>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
 rv:1.9.2.13) Gecko/20101208 Thunderbird/3.1.7
MIME-Version: 1.0
To: common-user@hadoop.apache.org
Subject: Re: Why Hadoop is slow in Cloud
References: <4D33C18A.5040504@orkash.com>
In-Reply-To: <4D33C18A.5040504@orkash.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

On 17/01/11 04:11, Adarsh Sharma wrote:
> Dear all,
>
> Yesterday I performed a kind of testing between *Hadoop in Standalone
> Servers* & *Hadoop in Cloud.
>
> *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
> which one node act as Master ( Namenode , Jobtracker ) and the remaining
> nodes act as slaves ( Datanodes, Tasktracker ).
> On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
> one Standalone Machine as *Hadoop Master* and the slaves are configured
> on the VM's in Cloud.
>
> I am confused about the stats obtained after the testing. What I
> concluded that the VM are giving half peformance as compared with
> Standalone Servers.

Interesting stats, nothing that massively surprises me, especially as 
your benchmarks are very much streaming through datasets. If you were 
doing something more CPU intensive (graph work, for example), things 
wouldn't look so bad

I've done stuff in this area.
http://www.slideshare.net/steve_l/farming-hadoop-inthecloud


>
> I am expected some slow down but at this level I never expect. Would
> this is genuine or there may be some configuration problem.
>
> I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
> Standalone Servers.
>
> Please have a look on the results and if interested comment on it.
>


The big killer here is File IO, with today's HDD controllers and virtual 
filesystems, disk IO is way underpowered compared to physical disk IO. 
Networking is reduced (but improving), and CPU can be pretty good, but 
disk is bad.


Why?

1.  Every access to a block in the VM is turned into virtual disk 
controller operations which are then interpreted by the VDC and turned 
into reads/writes in the virtual disk drive

2. which is turned into seeks, reads and writes in the physical hardware.

Some workarounds

-allocate physical disks for the HDFS filesystem, for the duration of 
the VMs.

-have the local hosts serve up a bit of their filesystem on a fast 
protocol (like NFS), and have every VM mount the local physical NFS 
filestore as their hadoop data dirs.