hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: question about when shuffle/sort start working
Date Mon, 01 Jun 2009 14:31:14 GMT
Todd Lipcon wrote:
> Hi Jianmin,
> 
> This is not (currently) supported by Hadoop (or Google's MapReduce either
> afaik). What you're looking for sounds like something more like Microsoft's
> Dryad.
> 
> One thing that is supported in versions of Hadoop after 0.19 is JVM reuse.
> If you enable this feature, task trackers will persist JVMs between jobs.
> You can then persist some state in static variables.
> 
> I'd caution you, however, from making too much use of this fact as anything
> but an optimization. The reason that Hadoop is limited to MR (or M+RM* as
> you said) is that simplicity and reliability often go hand in hand. If you
> start maintaining important state in RAM on the tasktracker JVMs, and one of
> them goes down, you may need to restart your entire job sequence from the
> top. In typical MapReduce, you may need to rerun a mapper or a reducer, but
> the state is all on disk ready to go.
> 
> -Todd
> 


I'd thought the question is not necessarily one of maintaining state, 
but of chaining the output from one job into another, where the # of 
iterations depends on the outcome of the previous set. Funnily enough, 
this is what you (apparently) end up having to do when implementing 
PageRank-like ranking as MR jobs:
http://skillsmatter.com/podcast/cloud-grid/having-fun-with-pagerank-and-mapreduce

Mime
View raw message