hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Performance issues with large map/reduce processes
Date Thu, 27 Dec 2007 15:36:49 GMT

Can you say a bit more about your processes?  Are they truly parallel maps
without any shared state?

Are you getting a good limit on maximum number of maps and reduces per

How are you measuring these times?  Do they include shuffle time as well as
map time?  Do they include time before running?

What happens on the large size problems if you decrease the number of maps,
but keep input size constant?

Finally, why do you have so many reduces?  Usually it is good to have at
most a small multiple of the number of machines in your cluster.

On 12/26/07 2:52 PM, "jag123" <jaganrvce@yahoo.com> wrote:

> Hi,
> I am running a map/reduce task on a large cluster (70+ machines). I use a
> single input file, and sufficient number of map/reduce tasks so that each
> map process gets 250k records. That is, if my  input file contains 1
> million records, I use 4 map and 4 reduce processes so that each map process
> gets 250k records.  The maps/reduce usually takes 30 seconds to complete.
> A strange thing happens when I scale this problem:
> 1 million records, 4 map + 4 reduce ==> 30 seconds per map process
> 5 million records, 20 map + 20 reduce ==>  1 minute per map process
> 50 million records, 200 map + 200 reduce ==>  3 minute per map process
> 500 million records, 2000 map + 2000 reduces ==> 45 minutes! per map process
> Note that in all the above cases, the map process performs the same amount
> of work (250k records).
> In all the cases, I use a single large input file. Hadoop breaks the file
> into ~16 MB chunks (about 250k records). Input format is
> TextInputFormat.class. I cannot think of any reason why this is happening.
> The task setup in all the above cases takes 30 seconds or so. But then the
> map process practically crawls. 

View raw message