hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood" <stuh...@webmail.us>
Subject Re: Tech Talk: Dryad
Date Fri, 09 Nov 2007 18:02:32 GMT
I did read the conclusion of the previous thread, which was that nobody "thought" that the
performance gains would be worth the added complexity. I simply think that if a patch is available,
the developer should be encouraged to submit it for review, since the topic has been discussed
so frequently.

I think our concept of "the" map/reduce primitive has been limited in scope to the capabilities
that Google described. There is no reason not to explore potentially beneficial additions
(even if Google didn't think they were worthwhile).

Yes, Dryad is more confusing, because it is using a more flexible primitive. I'm not suggesting
that Hadoop should be rewritten to use a DAG at its core, but we do already have the o.a.h.m.jobcontrol.JobControl
module, so _somebody_ must think the concept is useful.

Re: Side note: As the presenter explained, he uses a small example first to demonstrate the
linear speedup. Next (~32 minute) he goes to an example of sorting 10TB on 1800 machines in
~12 minutes...


-----Original Message-----
From: Owen O'Malley <oom@yahoo-inc.com>
Sent: Friday, November 9, 2007 12:32pm
To: hadoop-user@lucene.apache.org
Subject: Re: Tech Talk: Dryad

On Nov 9, 2007, at 8:49 AM, Stu Hood wrote:

> Currently there is no sanctioned method of 'piping' the reduce  
> output of one job directly into the map input of another (although  
> it has been discussed: see the thread I linked before: http:// 
> www.nabble.com/Poly-reduce--tf4313116.html ).

Did you read the conclusion of the previous thread? The performance  
gains in avoiding the second map input are trivial compared the gains  
in simplicity of having a single data path and re-execution story.  
During a reasonably large job, roughly 98% of your maps are reading  
data on the _same_ node. Once we put in rack locality, it will be  
even better.

I'd much much rather build the map/reduce primitive and support it  
very well than add the additional complexity of any sort of poly- 
reduce. I think it is very appropriate for systems like Pig to  
include that kind of optimization, but it should not be part of the  
base framework.

I watched the front of the Dryad talk and was struck by how complex  
it quickly became. It does give the application writer a lot of  
control, but to do the equivalent of a map/reduce sort with 100k maps  
and 4k reduces with automatic spill-over to disk during the shuffle  
seemed _really_ complicated.

On a side note, in the part of the talk that I watched, the scaling  
graph went from 2 to 9 nodes. Hadoop's scaling graphs go to 1000's of  
nodes. Did they ever suggest later in the talk that it scales up higher?

-- Owen

View raw message