hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bejoy KS" <bejoy.had...@gmail.com>
Subject Re: guessing number of reducers.
Date Wed, 21 Nov 2012 18:21:09 GMT
Hi Andy

It is usually so because if you have more reduce tasks than the reduce slots in your cluster
then a few of the reduce tasks will be in queue waiting for its turn. So it is better to keep
the num of reduce tasks slightly less than the reduce task capacity so that all reduce tasks
run at once in parallel.

But in some cases each reducer can process only certain volume of data due to some constraints,
like data beyond a certain limit may lead to OOMs. In such cases you may need to configure
the number of reducers totally based on your data and not based on slots.

Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: "Kartashov, Andy" <Andy.Kartashov@mpac.ca>
Date: Wed, 21 Nov 2012 17:49:50 
To: user@hadoop.apache.org<user@hadoop.apache.org>; bejoy.hadoop@gmail.com<bejoy.hadoop@gmail.com>
Subject: RE: guessing number of reducers.


I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity.
Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce
phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer.
So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better
performance results.

In general it is better to have the number of reduce tasks slightly less than the number of
available reduce slots in the cluster.
Bejoy KS

Sent from handheld, please excuse typos.
From: jamal sasha <jamalshasha@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<user@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing
i need loads of reducers but then most of them will be empty but at the same time one reducer
won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and
may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not
the intended recipient, please delete and contact the sender immediately. Please consider
the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe
qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts
par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite.
Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement
l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

View raw message