hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: Pipelining Mappers and Reducers
Date Tue, 27 Jul 2010 11:34:27 GMT
Hi,
>>What would really be great for me is if I could have the Reducer start processing
the map outputs as they are ready, and not after all Mappers finish
Check the property mapred.reduce.slowstart.completed.maps

>>I've read about chaining mappers, but to the best of my understanding the second line
of Mappers will only start after the first ones finished. Am I correct?
Not exactly, all the transformations will be in one go till the reduce barrier reaches

>>Someone also hinted to me that I could write a Combiner that Hadoop might invoke on
the Reducer's side when Mappers finish,
Combiners can be run on both map side and reduce side as soon as the buffer is full ( many
configuration properties control this ), and work well when your reduce operations are not
something like say, average.

HTH,
Amogh

On 7/27/10 4:36 PM, "Shai Erera" <serera@gmail.com> wrote:

Hi

I have a scenario for which I'd like to write a MR job in which Mappers do some work and eventually
the output of all mappers need to be combined by a single Reducer. Each Mapper outputs <key,value>
that is distinct from all other Mappers, meaning the Reducer.reduce() method always receives
a single element in the values argument of a specific key. Really - the Mappers are independent
of each others in their output.

What would really be great for me is if I could have the Reducer start processing the map
outputs as they are ready, and not after all Mappers finish. For example, I'm processing a
very large data set and the MR framework spawns hundreds of Mappers for the task. The output
of all Mappers though is required to be processed by 1 Reducer. It so happens to be that the
Reducer job is very heavy, compared to the Mappers, and while all Mappers finish in about
7 minutes (total wall clock time), the Reducer takes ~30 minutes.

In my cluster I can run 96 Mappers in parallel, so I'm pretty sure that if I could streamline
the outputs of the Mappers to the Reducer, I could gain some cycles back - I can easily limit
the number of Mappers to say 95 and have the Reducer constantly doing some job.

I've read about chaining mappers, but to the best of my understanding the second line of Mappers
will only start after the first ones finished. Am I correct?

Someone also hinted to me that I could write a Combiner that Hadoop might invoke on the Reducer's
side when Mappers finish, if say the data of the Mappers is very large and cannot be kept
in RAM. I haven't tried it yet, so if anyone can confirm this will indeed work, I'm willing
to give it a try. The output of the Mappers is very large, and therefore they already write
it directly to disk. So I'd like to avoid doing this serialization twice (once when the Mapper
works, and the second time when Hadoop will *flush* the Reducer's buffer - or whatever the
right terminology is).

I apologize if this has been raised before - if it has, could you please point me at the relevant
discussion/issue?

Shai


Mime
View raw message