hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Why Hadoop is slow in Cloud
Date Mon, 17 Jan 2011 15:41:04 GMT
On Mon, Jan 17, 2011 at 6:08 AM, Steve Loughran <stevel@apache.org> wrote:
> On 17/01/11 04:11, Adarsh Sharma wrote:
>> Dear all,
>> Yesterday I performed a kind of testing between *Hadoop in Standalone
>> Servers* & *Hadoop in Cloud.
>> *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in
>> which one node act as Master ( Namenode , Jobtracker ) and the remaining
>> nodes act as slaves ( Datanodes, Tasktracker ).
>> On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made
>> one Standalone Machine as *Hadoop Master* and the slaves are configured
>> on the VM's in Cloud.
>> I am confused about the stats obtained after the testing. What I
>> concluded that the VM are giving half peformance as compared with
>> Standalone Servers.
> Interesting stats, nothing that massively surprises me, especially as your
> benchmarks are very much streaming through datasets. If you were doing
> something more CPU intensive (graph work, for example), things wouldn't look
> so bad
> I've done stuff in this area.
> http://www.slideshare.net/steve_l/farming-hadoop-inthecloud
>> I am expected some slow down but at this level I never expect. Would
>> this is genuine or there may be some configuration problem.
>> I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in
>> Standalone Servers.
>> Please have a look on the results and if interested comment on it.
> The big killer here is File IO, with today's HDD controllers and virtual
> filesystems, disk IO is way underpowered compared to physical disk IO.
> Networking is reduced (but improving), and CPU can be pretty good, but disk
> is bad.
> Why?
> 1.  Every access to a block in the VM is turned into virtual disk controller
> operations which are then interpreted by the VDC and turned into
> reads/writes in the virtual disk drive
> 2. which is turned into seeks, reads and writes in the physical hardware.
> Some workarounds
> -allocate physical disks for the HDFS filesystem, for the duration of the
> VMs.
> -have the local hosts serve up a bit of their filesystem on a fast protocol
> (like NFS), and have every VM mount the local physical NFS filestore as
> their hadoop data dirs.

Q: "Why is my Nintendo emulator slow on a 800 MHZ computer made 10
years after Nintendo?"
A: Emulation

Everything you emulate you cut X% performance right off the top.

Emulation is great when you want to run mac on windows or freebsd on
linux or nintendo on linux. However most people would do better with
technologies that use kernel level isolation such as Linux containers,
Solaris Zones, Linux VServer (my favorite) http://linux-vserver.org/,
User Mode Linux or similar technologies that ISOLATE rather then

Sorry list I feel I rant about this bi-annually. I have just always
been so shocked about how many people get lured into cloud and
virtualized solutions for "better management" and "near native

View raw message