hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregory Lawrence <gr...@yahoo-inc.com>
Subject Re: Pipelining Mappers and Reducers
Date Tue, 27 Jul 2010 17:47:02 GMT

It's hard to determine what the best solution would be without knowing more about your problem.
In general, combiner functions work well but they will be of little value if each mapper output
contains a unique key. This is because combiner functions only "combine" multiple values associated
with the same key (e.g., counting the number of occurrences of a word). Another common approach
is to use two mapreduce jobs. The first job would use multiple reducers to do some processing.
Here, you can hopefully shrink the size of the data by generating, for example, some sufficient
statistics of what ultimately needs to be generated. A second mapreduce job would take the
intermediate output and produce the desired result.

As for the reducer processing map outputs as they are ready question, I believe that the copy
stage may start before all mappers are finished. However, the sorting and application of your
reduce function can not proceed until each mapper is finished.

Could you describe your problem in more detail?

Greg Lawrence

On 7/27/10 4:06 AM, "Shai Erera" <serera@gmail.com> wrote:


I have a scenario for which I'd like to write a MR job in which Mappers do some work and eventually
the output of all mappers need to be combined by a single Reducer. Each Mapper outputs <key,value>
that is distinct from all other Mappers, meaning the Reducer.reduce() method always receives
a single element in the values argument of a specific key. Really - the Mappers are independent
of each others in their output.

What would really be great for me is if I could have the Reducer start processing the map
outputs as they are ready, and not after all Mappers finish. For example, I'm processing a
very large data set and the MR framework spawns hundreds of Mappers for the task. The output
of all Mappers though is required to be processed by 1 Reducer. It so happens to be that the
Reducer job is very heavy, compared to the Mappers, and while all Mappers finish in about
7 minutes (total wall clock time), the Reducer takes ~30 minutes.

In my cluster I can run 96 Mappers in parallel, so I'm pretty sure that if I could streamline
the outputs of the Mappers to the Reducer, I could gain some cycles back - I can easily limit
the number of Mappers to say 95 and have the Reducer constantly doing some job.

I've read about chaining mappers, but to the best of my understanding the second line of Mappers
will only start after the first ones finished. Am I correct?

Someone also hinted to me that I could write a Combiner that Hadoop might invoke on the Reducer's
side when Mappers finish, if say the data of the Mappers is very large and cannot be kept
in RAM. I haven't tried it yet, so if anyone can confirm this will indeed work, I'm willing
to give it a try. The output of the Mappers is very large, and therefore they already write
it directly to disk. So I'd like to avoid doing this serialization twice (once when the Mapper
works, and the second time when Hadoop will *flush* the Reducer's buffer - or whatever the
right terminology is).

I apologize if this has been raised before - if it has, could you please point me at the relevant


View raw message