hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Adding Elasticity to Hadoop MapReduce
Date Thu, 15 Sep 2011 09:15:54 GMT
On 15/09/11 02:01, Bharath Ravi wrote:
> Thanks a lot, all!
>
> An end goal of mine was to make Hadoop as flexible as possible.
> Along the same lines, but unrelated to the above idea, was another I
> encountered,
> courtesy http://hadoopblog.blogspot.com/2010/11/hadoop-research-topics.html
>
> The blog mentions the ability to dynamically append Input.
> Specifically, can I append input to the Map and Reduce tasks after they've
> been started?

Dhruba is referring to something that they've actually implemented in 
their version of Hive, which is the ability to gradually increase the 
data input to a running Hive job.

This lets them do a query like "find 8 friends in california" without 
searching the entire dataset; pick a subset, search that, and if there 
are enough results, stop. If not: feed in some more data.

I have a paper on it that shows that for data with little or no skew, 
this is much faster than a full scan; for skewed data where all the 
results are in a subset of blocks it is about the same as a full scan 
-it depends on which block size is found.

> I haven't been able to find something like this at a precursory glance, but
> could someone
> advice me on this before I dig deeper?
>
> 1. Does such functionality exist, or is it being attempted?

It exists for Hive though not in trunk, to get it in there would be 
mostly a matter of taking the existing code and slotting it in.

> 2. I would assume most cases would simply require starting a second Job for
> the new input.

No, because that loses all existing work and requires rescheduling more 
work. The goal of this is to execute one job that can bail out early.

The Facebook code runs with Hive, for classic MR jobs the first step 
would be to allow Map tasks to finish early. I think there may be a way 
to do that and plan to do some experiments to see if I'm right.

What would be more dramatic would be for the JT to be aware that jobs 
may finish early and have it slowly ramp up the map operations if they 
don't set some "finished" flag (which would presumably be a shared 
counter), until the entire dataset gets processed if the early finish 
doesn't work. This slow-start could be taken into account in the 
scheduler which could than know that the initial resource needs of the 
Job are quite low, but may increase.

> However, are there practical use cases to such a feature?

See above

> 3. Are there any other ideas on such "flexibility" of the system that I
> could contribute to?

While it's great that you want to do big things in Hadoop, I'd recommend 
you start using it and learning your way around the codebase -especially 
of SVN trunk or the unreleased 0.23 branch, as they are where all major 
changes will go, and the MR engine has been radically reworked for 
better scheduling.

Start writing MR jobs that work under the new engine, using existing 
public datasets, or look at the layers above, then think how things 
could be improved.

Mime
View raw message