hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Adding Elasticity to Hadoop MapReduce
Date Thu, 15 Sep 2011 09:23:58 GMT
On 14/09/11 22:20, Ted Dunning wrote:
> This makes a bit of sense, but you have to worry about the inertia of the
> data.  Adding compute resources is easy.  Adding data resources, not so
> much.

I've done it. Like Ted says, pure compute nodes generate more network 
traffic on both reads and writes, if you bring up Datanodes then you 
have to leave them around. The strength is that the infrastructure can 
sell them to you for a lower $/hour with the condition that they can 
take them back when demand gets high; these compute-only nodes would be 
infrastructure-pre-emptible. Look for the presentation "farming hadoop 
in the cloud" for more details, though I don't discuss pre-emption or 

 > if the computation is not near the data, then it is likely to be
 > much less effective.

Which implies your infrastructure needs to be data aware and know to 
bring up the new VMs on the same racks as the HDFS nodes.

There are some infrastructure-aware runtimes -I am thinking of the 
Technical University of Berlin's Stratosphere project- which takes a job 
at higher level than MR commands -more like Pig or Hive- and comes up 
with an execution plan that can schedule for optimal acquisition and 
re-use of VMs, knowing both the cost of machines and the hysteresis of 
VM setup/teardown and hourly costs. You can then impose policies like 
"fast execution" or "lowest cost", and have different plans 

This is all PhD grade research with a group of highly skilled postgrads 
led by a professor who has worked in VLDBs, I would not attempt to 
retrofit this into Hadoop on my own. That said, if you want a PhD, you 
could contact that team or the UC Irvine people working on Algebricks 
and Hyracks and convince them that you should joint their teams.


View raw message