hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wellington Chevreuil <wellington.chevre...@gmail.com>
Subject Re: Hadoop map reduce merge algorithm
Date Thu, 12 Jan 2012 19:57:35 GMT
Intermediate data from the map phase is written to disk by the Mapper.
After that, the data will be sent to Reducer(s) and it will perform 3
steps:
  - shuffle: where all output data from mappers are sorted as input to
the Reducer(s);
  - sort: output data from mappers are grouped by key. This is done as
these data are fetched and occurs together with shuffle;
  - reduce: it is when reduce() method in the Reducer class is called
receiving the key and related values;

All the intermediate data assigned to an intermediate key is sent to
the same reducer (in sort and shuffle phase), and developer can set
the amount of Reducers to be ran in a given MR Job execution via
Job.setNumReduceTasks(int).  Increasing this number seems to benefit
reducers balancing. You can find more info in
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

Regards,
Wellington.

2012/1/12 Bai Shen <baishen.lists@gmail.com>:
> Can someone explain how the map reduce merge is done?  As far as I can tell,
> it appears to pull all of the spill files into one giant file to send to the
> reducer.  Is this correct?  Even if you set smaller spill files and a lower
> sort factor, the eventual merge is still the same.  It just takes more
> passes to get there.
>
> Thanks.

Mime
View raw message