hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@cloudera.com>
Subject Re: Seperate Server Sets for Map and Reduce
Date Tue, 20 Jul 2010 20:26:46 GMT
Hi RajVish,

I am just wondering why the reduce input is huge: would increasing the # of
reducers make it smaller or is it 'fixed cost'?  Having the reducer size >>
mapper size definitely makes it a very hard problem to schedule on a
homogeneous cluster, but it also may make it not scalable.

Regarding your question, you can certainly force the mappers/reducers ratio
be different on different nodes using *
mapred.tasktracker.{map,reduce}.tasks.maximum*, but this will have
implications on the data locality and scalability.

Also, you may still end up with the same problem since the mappers cache
their output on a local disk and mapper output == reducer input.

Alex K

On Tue, Jul 20, 2010 at 12:11 PM, RajVish <rajvish@yahoo.com> wrote:

> We have lots of servers but have limited storage pool. My map jobs are
> handle
> lots of small input files (approx 300Mb Compressed) but the reduce input is
> huge ( about 100Gb) requiring lots of temporary and local storage. I would
> like to divide my server pool into two kinds - one set with a small disks (
> for map jobs) and a few with big storage ( for the combine and reduce
> jobs).
> Is there something I can do that lets me force the reduce job to run on a
> specific nodes?
> I have done google searching and searching through some forums but not
> found.
> -best regards,
> Raj
> --
> View this message in context:
> http://old.nabble.com/Seperate-Server-Sets-for-Map-and-Reduce-tp29216327p29216327.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message