hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Pipelining Mappers and Reducers
Date Tue, 27 Jul 2010 11:06:53 GMT
Hi

I have a scenario for which I'd like to write a MR job in which Mappers do
some work and eventually the output of all mappers need to be combined by a
single Reducer. Each Mapper outputs <key,value> that is distinct from all
other Mappers, meaning the Reducer.reduce() method always receives a single
element in the values argument of a specific key. Really - the Mappers are
independent of each others in their output.

What would really be great for me is if I could have the Reducer start
processing the map outputs as they are ready, and not after all Mappers
finish. For example, I'm processing a very large data set and the MR
framework spawns hundreds of Mappers for the task. The output of all Mappers
though is required to be processed by 1 Reducer. It so happens to be that
the Reducer job is very heavy, compared to the Mappers, and while all
Mappers finish in about 7 minutes (total wall clock time), the Reducer takes
~30 minutes.

In my cluster I can run 96 Mappers in parallel, so I'm pretty sure that if I
could streamline the outputs of the Mappers to the Reducer, I could gain
some cycles back - I can easily limit the number of Mappers to say 95 and
have the Reducer constantly doing some job.

I've read about chaining mappers, but to the best of my understanding the
second line of Mappers will only start after the first ones finished. Am I
correct?

Someone also hinted to me that I could write a Combiner that Hadoop might
invoke on the Reducer's side when Mappers finish, if say the data of the
Mappers is very large and cannot be kept in RAM. I haven't tried it yet, so
if anyone can confirm this will indeed work, I'm willing to give it a try.
The output of the Mappers is very large, and therefore they already write it
directly to disk. So I'd like to avoid doing this serialization twice (once
when the Mapper works, and the second time when Hadoop will *flush* the
Reducer's buffer - or whatever the right terminology is).

I apologize if this has been raised before - if it has, could you please
point me at the relevant discussion/issue?

Shai

Mime
View raw message