hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Why Hadoop is slow in Cloud
Date Fri, 21 Jan 2011 09:59:38 GMT
On 21/01/11 09:20, Evert Lammerts wrote:
>> Even with performance hit, there are still benefits running Hadoop this
>> way
>>    -as you only consume/pay for CPU time you use, if you are only
>> running
>> batch jobs, its lower cost than having a hadoop cluster that is under-
>> used.
>>    -if your data is stored in the cloud infrastructure, then you need to
>> data mine it in VMs, unless you want to take the time and money hit of
>> moving it out, and have somewhere to store it.
>> -if the infrastructure lets you, you can lock down the cluster so it is
>> secure.
>> Where a physical cluster is good is that it is a very low cost way of
>> storing data, provided you can analyse it with Hadoop, and provided you
>> can keep that cluster busy most of the time, either with Hadoop work or
>> other scheduled work. If your cluster is idle for computation, you are
>> still paying the capital and (reduced) electricity costs, so the cost
>> of
>> storage and what compute you do effectively increases.
> Agreed, but this has little to do with Hadoop as a middleware and more to do
> with the benefits of virtualized vs physical infrastructure. I agree that it
> is convenient to use HDFS as a DFS to keep your data local to your VMs, but
> you could choose other DFS's as well.

We don't use HDFS, we bring up VMs close to where the data persists.


> The major benefit of Hadoop is its data-locality principle, and this is what
> you give up when you move to the cloud. Regardless of whether you store your
> data within your VM or on a NAS, it *will* have to travel over a line. As
> soon as that happens you lose the benefit of data-locality and are left with
> MapReduce as a way for parallel computing. And in that case you could use
> less restrictive software, like maybe PBS. You could even install HOD on
> your virtual cluster, if you'd like the possibility of MapReduce.

We don't suffer locality hits so much, but you do pay for the extra 
infrastructure costs of a more agile datacentre, and if you go to 
redundancy in hardware over replication, you have less places to run 
your code.

Even on EC2, which doesn't let you tell it what datasets you want to 
play with for its VM placer to use in its decisions, once data is in the 
datanodes you do get locality

> Adarsh, there are probably results around of more generic benchmark tools
> (Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should
> give you a better idea of the penalties of virtualization. (Our experience
> with a number of technologies on our OpenNebula cloud is, like Steve points
> out, that you mainly pay for disk I/O performance.)

-would be interesting to see anything you can publish there...

> I think a decision to go with either cloud or physical infrastructure should
> be based on the frequency, intensity and types of computation you expect on
> the short term (that should include operations dealing with data), and the
> way you think these parameters will develop on a mid-long term. And then
> compare the prices of a physical cluster that meets those demands (make sure
> to include power and operations) and the investment you would otherwise need
> to make in Cloud.


View raw message