hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball ...@cs.washington.edu>
Subject Re: Tech Talk: Dryad
Date Fri, 09 Nov 2007 22:47:21 GMT
Is there a timeframe for when rack-locality will be available?
- Aaron

Owen O'Malley wrote:
>
> On Nov 9, 2007, at 8:49 AM, Stu Hood wrote:
>
>> Currently there is no sanctioned method of 'piping' the reduce output 
>> of one job directly into the map input of another (although it has 
>> been discussed: see the thread I linked before: 
>> http://www.nabble.com/Poly-reduce--tf4313116.html ).
>
> Did you read the conclusion of the previous thread? The performance 
> gains in avoiding the second map input are trivial compared the gains 
> in simplicity of having a single data path and re-execution story. 
> During a reasonably large job, roughly 98% of your maps are reading 
> data on the _same_ node. Once we put in rack locality, it will be even 
> better.
>
> I'd much much rather build the map/reduce primitive and support it 
> very well than add the additional complexity of any sort of 
> poly-reduce. I think it is very appropriate for systems like Pig to 
> include that kind of optimization, but it should not be part of the 
> base framework.
>
> I watched the front of the Dryad talk and was struck by how complex it 
> quickly became. It does give the application writer a lot of control, 
> but to do the equivalent of a map/reduce sort with 100k maps and 4k 
> reduces with automatic spill-over to disk during the shuffle seemed 
> _really_ complicated.
>
> On a side note, in the part of the talk that I watched, the scaling 
> graph went from 2 to 9 nodes. Hadoop's scaling graphs go to 1000's of 
> nodes. Did they ever suggest later in the talk that it scales up higher?
>
> -- Owen

Mime
View raw message