hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michel Tourn" <mic...@yahoo-inc.com>
Subject Estimating number of worker nodes
Date Tue, 14 Feb 2006 02:57:52 GMT

sometimes an MR application really wants to set an absolute number of tasks:
 num Reduce Tasks = 1 (so that the result is available in a single file)
The job submission API makes this case easy.

but at other times an MR application would rather
set a relative number of tasks:
 num Map Tasks = 5 * num active worker nodes
 num Reduce Tasks = 2 * num active worker nodes

Intuitively, it seems to be a good thing to consider
num-active-worker-nodes as a variable rather than
as a constant known by the MapRed user.

 -the cluster may be expanded/shrinked without the MapRed's user being aware
of it.
 -the MapRed user may sometimes run tests on a smaller ('personal') MapRed

To implement this, some component of the system needs
to know: "num active worker nodes"

The JobTracker knows num. active nodes (taskTrackers.size())
but the JobClient does not.

So I can see two possible ways to add this functionality:
(specifying a number of tasks relative to the cluster size)

1. JobTracker exposes num. active nodes to JobClient
  (via an extension to the Job"Submission"Protocol)
   Some client code connects to the JobTracker twice:
   once two learn num-workers
   once to submit a MapRed job, using setNumMapTasks( 5 * num-workers )

2. JobConf is extended to accept relative number tasks.
   Existing: setNumMapTasks(int n)               "mapred.map.tasks"
   New :     setNumMapTasksPerTaskTracker(int n)
   JobClient must set either value.
   Then when JobTracker accepts a job, it simply translates if necessary
    tasks = taskspertracker * num-workers.

Do we agree this is useful,
and which do you think is the best option? (1. or 2.)


View raw message