hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood" <stuh...@webmail.us>
Subject Re: Tech Talk: Dryad
Date Fri, 09 Nov 2007 16:49:23 GMT
Sorry, I just noticed that I mistyped... Meant to say:
> direct reduce->map links.

Currently there is no sanctioned method of 'piping' the reduce output of one job directly
into the map input of another (although it has been discussed: see the thread I linked before:
http://www.nabble.com/Poly-reduce--tf4313116.html ).

The main focus of Hadoop is large clusters and long jobs, but an _optional_ method of job
composition that increases speed and the chance of failure would definitely be worthwhile
for small-to-medium clusters and short jobs.

Perhaps Vuk Ercegovac could be convinced to submit the patch he mentioned in the thread I
linked.

Thanks,
Stu



-----Original Message-----
From: Doug Cutting <cutting@apache.org>
Sent: Friday, November 9, 2007 11:20am
To: hadoop-user@lucene.apache.org
Subject: Re: Tech Talk: Dryad

Stu Hood wrote:
> The slide comparing the time taken to spill to disk between vertices vs operating purely
in memory (around minute 26) is definitely something to think about.

I have not had a chance to watch the video yet, but, in MapReduce, if 
the intermediate dataset is larger than the RAM on your cluster, then 
you must spill to disk in order to sort.  (When it is smaller, then we 
should of course avoid disk. but that's not the typical case.)  If you 
don't sort, then it's just map, and piping a sequence of maps together 
is trivial to do on the same host, no need to even move the data over 
the wire.  So I don't yet see the direct relevance.  What am I missing? 
  (Maybe I should watch the video...)

Doug



Mime
View raw message