hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Kennedy <james.kenn...@troove.net>
Subject Examples of chained MapReduce?
Date Fri, 22 Jun 2007 17:39:22 GMT
I was wondering if anyone knows of any examples of truly chained, truly 
distributed MapReduce jobs.

So far what I've had trouble finding examples of MapReduce jobs that are 
kicked-off by some one time process that in turn kick off other 
MapReduce jobs long after the initial driver process is dead.  This 
would be more distributed and fault tolerant since it removes dependency 
on a driver process.

I looked at the Nutch crawl code for example which iteratively builds up 
a url db using successive MapReduces up to a certain depth.  But this 
all done from within a for loop of a single process even though each 
individual MapReduce is distributed.

Also, I notice that both Google and Hadoop's example of the distributed 
sort fails to deal with the fact that the result is multiple sorted 
files... this isn't a complete sort since the output files still need to 
be merge-sorted don't they?  To complete the algorithm, could the 
Reducer kick of a subsequent merge sort MapReduce on the result files?  
Or maybe there's something I'm not understanding...

View raw message