hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adarsh Sharma <adarsh.sha...@orkash.com>
Subject Hadoop on Cloud or Not
Date Thu, 09 Dec 2010 10:29:25 GMT

I have Eucalyptus 1.6.2 installed on ubuntu 10.04 using source 
installation with kvm. Currently I have ten nodes in my cloud in a 
single cluster architecture.
Also I have tested Hadoop on VM's and run several  jobs

I am trying to run Hadoop in a cloud environment. So I will launch 
hadoop instances on the cloud. Now there is huge data on each Hadoop 
node so I am planning to use volumes as of now to store that data of 
each instance i.e Hadoop node. But since volumes are stored at Storage 
controllers so this means that there is continuous movement of data 
(lots of GBs) in cloud network from SC to node and also the response 
time of work done on Hadoop instances will be slow due to time taken by 
data to travel in the network.

So, now is it possible to store volumes (or any other way) on the nodes 
so that above problem can be resolved.

Second case : I can store data on the hard disk attached to the nodes 
and Hadoop instances can access that data easily but for that I would be 
required to start the instances on the node where data has been stored. 
So for this can I by using any hack or by anything decide the node for a 
instance to be started.

Can anyone who has some working experience with Hadoop on cloud 
environment give me any pointers?
I will really appreciate any sort of support on this.

Finally is it worthful to do this as I previously recieve some response 
like this :

> Is it possible to run Hadoop in VMs on Production Clusters so that we
> have 10000s of nodes on 100s of servers to achieve high performance
> through Cloud Computing.

you don't achieve performance that way. You are better off with 1VM per 
physical host, and you will need to talk to a persistent filestore for 
the data you want to retain. Running >1 VM per physical host just 
creates conflict for things like disk, ether and CPU that the virtual OS 
won't be aware of. Also, VM to disk performance is pretty bad right now, 
though that's improving.

Thanks & Regards

Adarsh Sharma

View raw message