Author: yhemanth Date: Tue May 5 08:42:35 2009 New Revision: 771622 URL: http://svn.apache.org/viewvc?rev=771622&view=rev Log: HADOOP-5736. Update the capacity scheduler documentation for features like memory based scheduling, job initialization and removal of pre-emption. Contributed by Sreekanth Ramakrishnan. Modified: hadoop/core/trunk/CHANGES.txt hadoop/core/trunk/conf/capacity-scheduler.xml.template hadoop/core/trunk/src/docs/src/documentation/content/xdocs/capacity_scheduler.xml hadoop/core/trunk/src/docs/src/documentation/content/xdocs/cluster_setup.xml hadoop/core/trunk/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml Modified: hadoop/core/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/core/trunk/CHANGES.txt?rev=771622&r1=771621&r2=771622&view=diff ============================================================================== --- hadoop/core/trunk/CHANGES.txt (original) +++ hadoop/core/trunk/CHANGES.txt Tue May 5 08:42:35 2009 @@ -536,6 +536,10 @@ HADOOP-5711. Change Namenode file close log to info. (szetszwo) + HADOOP-5736. Update the capacity scheduler documentation for features + like memory based scheduling, job initialization and removal of pre-emption. + (Sreekanth Ramakrishnan via yhemanth) + OPTIMIZATIONS BUG FIXES Modified: hadoop/core/trunk/conf/capacity-scheduler.xml.template URL: http://svn.apache.org/viewvc/hadoop/core/trunk/conf/capacity-scheduler.xml.template?rev=771622&r1=771621&r2=771622&view=diff ============================================================================== --- hadoop/core/trunk/conf/capacity-scheduler.xml.template (original) +++ hadoop/core/trunk/conf/capacity-scheduler.xml.template Tue May 5 08:42:35 2009 @@ -60,16 +60,13 @@ mapred.capacity-scheduler.task.default-pmem-percentage-in-vmem -1 - If mapred.task.maxpmem is set to -1, this configuration will - be used to calculate job's physical memory requirements as a percentage of - the job's virtual memory requirements set via mapred.task.maxvmem. This - property thus provides default value of physical memory for job's that - don't explicitly specify physical memory requirements. - - If not explicitly set to a valid value, scheduler will not consider - physical memory for scheduling even if virtual memory based scheduling is - enabled(by setting valid values for both mapred.task.default.maxvmem and - mapred.task.limit.maxvmem). + A percentage (float) of the default VM limit for jobs + (mapred.task.default.maxvm). This is the default RAM task-limit + associated with a task. Unless overridden by a job's setting, this + number defines the RAM task-limit. + + If this property is missing, or set to an invalid value, scheduling + based on physical memory, RAM, is disabled. Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/capacity_scheduler.xml URL: http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/capacity_scheduler.xml?rev=771622&r1=771621&r2=771622&view=diff ============================================================================== --- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/capacity_scheduler.xml (original) +++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/capacity_scheduler.xml Tue May 5 08:42:35 2009 @@ -29,7 +29,9 @@
Purpose -

This document describes the Capacity Scheduler, a pluggable Map/Reduce scheduler for Hadoop which provides a way to share large clusters.

+

This document describes the Capacity Scheduler, a pluggable + Map/Reduce scheduler for Hadoop which provides a way to share + large clusters.

@@ -41,19 +43,17 @@ Support for multiple queues, where a job is submitted to a queue.
  • - Queues are guaranteed a fraction of the capacity of the grid (their - 'guaranteed capacity') in the sense that a certain capacity of - resources will be at their disposal. All jobs submitted to a - queue will have access to the capacity guaranteed to the queue. + Queues are allocated a fraction of the capacity of the grid in the + sense that a certain capacity of resources will be at their + disposal. All jobs submitted to a queue will have access to the + capacity allocated to the queue.
  • - Free resources can be allocated to any queue beyond its guaranteed - capacity. These excess allocated resources can be reclaimed and made - available to another queue in order to meet its capacity guarantee. -
  • -
  • - The scheduler guarantees that excess resources taken from a queue - will be restored to it within N minutes of its need for them. + Free resources can be allocated to any queue beyond it's capacity. + When there is demand for these resources from queues running below + capacity at a future point in time, as tasks scheduled on these + resources complete, they will be assigned to jobs on queues + running below the capacity.
  • Queues optionally support job priorities (disabled by default). @@ -61,7 +61,9 @@
  • Within a queue, jobs with higher priority will have access to the queue's resources before jobs with lower priority. However, once a - job is running, it will not be preempted for a higher priority job. + job is running, it will not be preempted for a higher priority job, + though new tasks from the higher priority job will be + preferentially scheduled.
  • In order to prevent one or more users from monopolizing its @@ -84,59 +86,34 @@

    Note that many of these steps can be, and will be, enhanced over time to provide better algorithms.

    -

    Whenever a TaskTracker is free, the Capacity Scheduler first picks a - queue that needs to reclaim any resources the earliest (this is a queue - whose resources were temporarily being used by some other queue and now - needs access to those resources). If no such queue is found, it then picks +

    Whenever a TaskTracker is free, the Capacity Scheduler picks a queue which has most free space (whose ratio of # of running slots to - guaranteed capacity is the lowest).

    + capacity is the lowest).

    -

    Once a queue is selected, the scheduler picks a job in the queue. Jobs +

    Once a queue is selected, the Scheduler picks a job in the queue. Jobs are sorted based on when they're submitted and their priorities (if the queue supports priorities). Jobs are considered in order, and a job is selected if its user is within the user-quota for the queue, i.e., the user is not already using queue resources above his/her limit. The - scheduler also makes sure that there is enough free memory in the + Scheduler also makes sure that there is enough free memory in the TaskTracker to tun the job's task, in case the job has special memory requirements.

    -

    Once a job is selected, the scheduler picks a task to run. This logic +

    Once a job is selected, the Scheduler picks a task to run. This logic to pick a task remains unchanged from earlier versions.

  • - Reclaiming capacity - -

    Periodically, the scheduler determines:

    - - -
    - -
    Installation -

    The capacity scheduler is available as a JAR file in the Hadoop +

    The Capacity Scheduler is available as a JAR file in the Hadoop tarball under the contrib/capacity-scheduler directory. The name of the JAR file would be on the lines of hadoop-*-capacity-scheduler.jar.

    -

    You can also build the scheduler from source by executing +

    You can also build the Scheduler from source by executing ant package, in which case it would be available under build/contrib/capacity-scheduler.

    -

    To run the capacity scheduler in your Hadoop installation, you need +

    To run the Capacity Scheduler in your Hadoop installation, you need to put it on the CLASSPATH. The easiest way is to copy the hadoop-*-capacity-scheduler.jar from to HADOOP_HOME/lib. Alternatively, you can modify @@ -148,9 +125,9 @@ Configuration

    - Using the capacity scheduler + Using the Capacity Scheduler

    - To make the Hadoop framework use the capacity scheduler, set up + To make the Hadoop framework use the Capacity Scheduler, set up the following property in the site configuration:

    @@ -168,7 +145,7 @@ Setting up queues

    You can define multiple queues to which users can submit jobs with - the capacity scheduler. To define multiple queues, you should edit + the Capacity Scheduler. To define multiple queues, you should edit the site configuration for Hadoop and modify the mapred.queue.names property.

    @@ -186,8 +163,8 @@
    Configuring properties for queues -

    The capacity scheduler can be configured with several properties - for each queue that control the behavior of the scheduler. This +

    The Capacity Scheduler can be configured with several properties + for each queue that control the behavior of the Scheduler. This configuration is in the conf/capacity-scheduler.xml. By default, the configuration is set up for one queue, named default.

    @@ -195,10 +172,10 @@ configuration, you should use the property name as mapred.capacity-scheduler.queue.<queue-name>.<property-name>.

    -

    For example, to define the property guaranteed-capacity +

    For example, to define the property capacity for queue named research, you should specify the property name as - mapred.capacity-scheduler.queue.research.guaranteed-capacity. + mapred.capacity-scheduler.queue.research.capacity.

    The properties defined for queues and their descriptions are @@ -206,15 +183,10 @@

    - - - - - + +
    NameDescription
    mapred.capacity-scheduler.queue.<queue-name>.guaranteed-capacityPercentage of the number of slots in the cluster that are - guaranteed to be available for jobs in this queue. - The sum of guaranteed capacities for all queues should be less - than or equal 100.
    mapred.capacity-scheduler.queue.<queue-name>.reclaim-time-limitThe amount of time, in seconds, before which resources - distributed to other queues will be reclaimed.
    mapred.capacity-scheduler.queue.<queue-name>.capacityPercentage of the number of slots in the cluster that are made + to be available for jobs in this queue. The sum of capacities + for all queues should be less than or equal 100.
    mapred.capacity-scheduler.queue.<queue-name>.supports-priority If true, priorities of jobs will be taken into account in scheduling @@ -237,27 +209,133 @@
    - Configuring the capacity scheduler -

    The capacity scheduler's behavior can be controlled through the - following properties. + Memory management + +

    The Capacity Scheduler supports scheduling of tasks on a + TaskTracker(TT) based on a job's memory requirements + and the availability of RAM and Virtual Memory (VMEM) on the TT node. + See the Hadoop + Map/Reduce tutorial for details on how the TT monitors + memory usage.

    +

    Currently the memory based scheduling is only supported + in Linux platform.

    +

    Memory-based scheduling works as follows:

    +
      +
    1. The absence of any one or more of three config parameters + or -1 being set as value of any of the parameters, + mapred.tasktracker.vmem.reserved, + mapred.task.default.maxvmem, or + mapred.task.limit.maxvmem, disables memory-based + scheduling, just as it disables memory monitoring for a TT. These + config parameters are described in the + Hadoop Map/Reduce + tutorial. The value of + mapred.tasktracker.vmem.reserved is + obtained from the TT via its heartbeat. +
    2. +
    3. If all the three mandatory parameters are set, the Scheduler + enables VMEM-based scheduling. First, the Scheduler computes the free + VMEM on the TT. This is the difference between the available VMEM on the + TT (the node's total VMEM minus the offset, both of which are sent by + the TT on each heartbeat)and the sum of VMs already allocated to + running tasks (i.e., sum of the VMEM task-limits). Next, the Scheduler + looks at the VMEM requirements for the job that's first in line to + run. If the job's VMEM requirements are less than the available VMEM on + the node, the job's task can be scheduled. If not, the Scheduler + ensures that the TT does not get a task to run (provided the job + has tasks to run). This way, the Scheduler ensures that jobs with + high memory requirements are not starved, as eventually, the TT + will have enough VMEM available. If the high-mem job does not have + any task to run, the Scheduler moves on to the next job. +
    4. +
    5. In addition to VMEM, the Capacity Scheduler can also consider + RAM on the TT node. RAM is considered the same way as VMEM. TTs report + the total RAM available on their node, and an offset. If both are + set, the Scheduler computes the available RAM on the node. Next, + the Scheduler figures out the RAM requirements of the job, if any. + As with VMEM, users can optionally specify a RAM limit for their job + (mapred.task.maxpmem, described in the Map/Reduce + tutorial). The Scheduler also maintains a limit for this value + (mapred.capacity-scheduler.task.default-pmem-percentage-in-vmem, + described below). All these three values must be set for the + Scheduler to schedule tasks based on RAM constraints. +
    6. +
    7. The Scheduler ensures that jobs cannot ask for RAM or VMEM higher + than configured limits. If this happens, the job is failed when it + is submitted. +
    8. +
    + +

    As described above, the additional scheduler-based config + parameters are as follows:

    + + + + + + + + + +
    NameDescription
    mapred.capacity-scheduler.task.default-pmem-percentage-in-vmemA percentage of the default VMEM limit for jobs + (mapred.task.default.maxvmem). This is the default + RAM task-limit associated with a task. Unless overridden by a + job's setting, this number defines the RAM task-limit.
    mapred.capacity-scheduler.task.limit.maxpmemConfiguration which provides an upper limit to maximum physical + memory which can be specified by a job. If a job requires more + physical memory than what is specified in this limit then the same + is rejected.
    +
    +
    + Job Initialization Parameters +

    Capacity scheduler lazily initializes the jobs before they are + scheduled, for reducing the memory footprint on jobtracker. + Following are the parameters, by which you can control the laziness + of the job initialization. The following parameters can be + configured in capacity-scheduler.xml

    + + + + + + - + + - - + +
    NameDescription
    + mapred.capacity-scheduler.queue.<queue-name>.maximum-initialized-jobs-per-user + + Maximum number of jobs which are allowed to be pre-initialized for + a particular user in the queue. Once a job is scheduled, i.e. + it starts running, then that job is not considered + while scheduler computes the maximum job a user is allowed to + initialize. +
    NameDescription + mapred.capacity-scheduler.init-poll-interval + + Amount of time in miliseconds which is used to poll the scheduler + job queue to look for jobs to be initialized. +
    mapred.capacity-scheduler.reclaimCapacity.intervalThe time interval, in seconds, between which the scheduler - periodically determines whether capacity needs to be reclaimed for - any queue. The default value is 5 seconds. - + mapred.capacity-scheduler.init-worker-threads + + Number of worker threads which would be used by Initialization + poller to initialize jobs in a set of queue. If number mentioned + in property is equal to number of job queues then a thread is + assigned jobs from one queue. If the number configured is lesser than + number of queues, then a thread can get jobs from more than one queue + which it initializes in a round robin fashion. If the number configured + is greater than number of queues, then number of threads spawned + would be equal to number of job queues. +
    - -
    - +
    - Reviewing the configuration of the capacity scheduler + Reviewing the configuration of the Capacity Scheduler

    Once the installation and configuration is completed, you can review it after starting the Map/Reduce cluster from the admin UI. @@ -271,7 +349,8 @@ Information column against each queue.

    - + + Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/cluster_setup.xml URL: http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/cluster_setup.xml?rev=771622&r1=771621&r2=771622&view=diff ============================================================================== --- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/cluster_setup.xml (original) +++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/cluster_setup.xml Tue May 5 08:42:35 2009 @@ -473,7 +473,147 @@ - +
    + Memory management +

    Users/admins can also specify the maximum virtual memory + of the launched child-task, and any sub-process it launches + recursively, using mapred.child.ulimit. Note that + the value set here is a per process limit. + The value for mapred.child.ulimit should be specified + in kilo bytes (KB). And also the value must be greater than + or equal to the -Xmx passed to JavaVM, else the VM might not start. +

    + +

    Note: mapred.child.java.opts are used only for + configuring the launched child tasks from task tracker. Configuring + the memory options for daemons is documented in + + cluster_setup.html

    + +

    The memory available to some parts of the framework is also + configurable. In map and reduce tasks, performance may be influenced + by adjusting parameters influencing the concurrency of operations and + the frequency with which data will hit disk. Monitoring the filesystem + counters for a job- particularly relative to byte counts from the map + and into the reduce- is invaluable to the tuning of these + parameters.

    +
    + +
    + Memory monitoring +

    A TaskTracker(TT) can be configured to monitor memory + usage of tasks it spawns, so that badly-behaved jobs do not bring + down a machine due to excess memory consumption. With monitoring + enabled, every task is assigned a task-limit for virtual memory (VMEM). + In addition, every node is assigned a node-limit for VMEM usage. + A TT ensures that a task is killed if it, and + its descendants, use VMEM over the task's per-task limit. It also + ensures that one or more tasks are killed if the sum total of VMEM + usage by all tasks, and their descendents, cross the node-limit.

    + +

    Users can, optionally, specify the VMEM task-limit per job. If no + such limit is provided, a default limit is used. A node-limit can be + set per node.

    +

    Currently the memory monitoring and management is only supported + in Linux platform.

    +

    To enable monitoring for a TT, the + following parameters all need to be set:

    + + + + + + + + + +
    NameTypeDescription
    mapred.tasktracker.vmem.reservedlongA number, in bytes, that represents an offset. The total VMEM on + the machine, minus this offset, is the VMEM node-limit for all + tasks, and their descendants, spawned by the TT. +
    mapred.task.default.maxvmemlongA number, in bytes, that represents the default VMEM task-limit + associated with a task. Unless overridden by a job's setting, + this number defines the VMEM task-limit. +
    mapred.task.limit.maxvmemlongA number, in bytes, that represents the upper VMEM task-limit + associated with a task. Users, when specifying a VMEM task-limit + for their tasks, should not specify a limit which exceeds this amount. +
    + +

    In addition, the following parameters can also be configured.

    + + + + + + +
    NameTypeDescription
    mapred.tasktracker.taskmemorymanager.monitoring-intervallongThe time interval, in milliseconds, between which the TT + checks for any memory violation. The default value is 5000 msec + (5 seconds). +
    + +

    Here's how the memory monitoring works for a TT.

    +
      +
    1. If one or more of the configuration parameters described + above are missing or -1 is specified , memory monitoring is + disabled for the TT. +
    2. +
    3. In addition, monitoring is disabled if + mapred.task.default.maxvmem is greater than + mapred.task.limit.maxvmem. +
    4. +
    5. If a TT receives a task whose task-limit is set by the user + to a value larger than mapred.task.limit.maxvmem, it + logs a warning but executes the task. +
    6. +
    7. Periodically, the TT checks the following: +
        +
      • If any task's current VMEM usage is greater than that task's + VMEM task-limit, the task is killed and reason for killing + the task is logged in task diagonistics . Such a task is considered + failed, i.e., the killing counts towards the task's failure count. +
      • +
      • If the sum total of VMEM used by all tasks and descendants is + greater than the node-limit, the TT kills enough tasks, in the + order of least progress made, till the overall VMEM usage falls + below the node-limt. Such killed tasks are not considered failed + and their killing does not count towards the tasks' failure counts. +
      • +
      +
    8. +
    + +

    Schedulers can choose to ease the monitoring pressure on the TT by + preventing too many tasks from running on a node and by scheduling + tasks only if the TT has enough VMEM free. In addition, Schedulers may + choose to consider the physical memory (RAM) available on the node + as well. To enable Scheduler support, TTs report their memory settings + to the JobTracker in every heartbeat. Before getting into details, + consider the following additional memory-related parameters than can be + configured to enable better scheduling:

    + + + + + +
    NameTypeDescription
    mapred.tasktracker.pmem.reservedintA number, in bytes, that represents an offset. The total + physical memory (RAM) on the machine, minus this offset, is the + recommended RAM node-limit. The RAM node-limit is a hint to a + Scheduler to scheduler only so many tasks such that the sum + total of their RAM requirements does not exceed this limit. + RAM usage is not monitored by a TT. +
    + +

    A TT reports the following memory-related numbers in every + heartbeat:

    +
      +
    • The total VMEM available on the node.
    • +
    • The value of mapred.tasktracker.vmem.reserved, + if set.
    • +
    • The total RAM available on the node.
    • +
    • The value of mapred.tasktracker.pmem.reserved, + if set.
    • +
    +
    +
    Task Controllers

    Task controllers are classes in the Hadoop Map/Reduce Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml URL: http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml?rev=771622&r1=771621&r2=771622&view=diff ============================================================================== --- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml (original) +++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml Tue May 5 08:42:35 2009 @@ -1105,8 +1105,26 @@ counters for a job- particularly relative to byte counts from the map and into the reduce- is invaluable to the tuning of these parameters.

    + +

    Users can choose to override default limits of Virtual Memory and RAM + enforced by the task tracker, if memory management is enabled. + Users can set the following parameter per job:

    + + + + + + + +
    NameTypeDescription
    mapred.task.maxvmemintA number, in bytes, that represents the maximum Virtual Memory + task-limit for each task of the job. A task will be killed if + it consumes more Virtual Memory than this number. +
    mapred.task.maxpmemintA number, in bytes, that represents the maximum RAM task-limit + for each task of the job. This number can be optionally used by + Schedulers to prevent over-scheduling of tasks on a node based + on RAM needs. +
    -
    Map Parameters