hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4035) Modify the capacity scheduler (HADOOP-3445) to schedule tasks based on memory requirements and task trackers free memory
Date Fri, 31 Oct 2008 11:57:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644267#action_12644267
] 

Vivek Ratan commented on HADOOP-4035:
-------------------------------------

Since the issue of dealing with memory-intensive and badly behaved jobs has spanned more than
one Jira, here's the latest summary on the overall proposal (following some offline discussions).


The problem, as stated originally in HADOOP-3581, is that certain badly-behaved jobs end up
using too much memory on a node and can bring down that node. We need to prevent this. A related
requirement, as described in HADOOP-3759, is that the system respects different, and legitimate,
memory requirements of different jobs.

There are two independent parts to solving this problem: monitoring and scheduling. Let's
look at monitoring first. 

Monitoring
----------------

We want to ensure that the sum total of virtual memory (VM) usage by all tasks does not go
over a limit (call this the _max-VM-per-node_ limit). That's really what brings down a machine.
To detect badly behaved jobs, we want to associate a limit with each task (call this the _max-VM-per-task_
limit) such that a task is considered badly behaved if its VM usage goes over this limit.
Think of the _max-VM-per-task_ limit as a kill limit. A TT monitors each task for its memory
usage (this includes the memory used by the task's descendants). If a task's memory usage
goes over its  _max-VM-per-task_ limit, that task is killed. This monitoring has been implemented
in HADOOP-3581. In addition, a TT monitors the total memory usage of all tasks spawned by
the TT. If this value goes over the _max-VM-per-node_ limit, the TT needs to kill one or more
tasks. As a simple solution, the TT can kill one or more tasks that started most recently.
This approach has been suggested in HADOOP-4523. Tasks that are killed because they went over
their memory limit should be treated as failed, since they violated their contract. Tasks
that are killed because the sum total of memory usage was over a limit should be treated as
killed, since it's not really their fault. 

How do we specify these limits? 
* *for _max-VM-per-node_*: HADOOP-3581 provides a config option, _mapred.tasktracker.tasks.maxmemory_
, which acts as the _max-VM-per-node_ limit. As per discussions in this Jira, and in HADOOP-4523,
this needs to be enhanced.  _mapred.tasktracker.tasks.maxmemory_ should be replaced by _mapred.tasktracker.virtualmemory.reserved_,
which indicates an offset (in MB?). _max-VM-per-node_ is then the total VM on the machine,
minus this offset. How do we get the total VM on the machine? This can be done by the plugin
interface that Owen proposed earlier. 
* *for _max-VM-per-task_*: HADOOP-3759 and HADOOP-4439 define a cluster-wide configuration,
_mapred.task.default.maxmemory_, that describes the default maximum VM associated per task.
Rename it to _mapred.task.default.maxvm_ for consistency. This is the default _max-VM-per-task_
limit associated with a task. To support jobs that need higher or lower limits, this value
can be overridden by individual jobs. A job can set a config value, _mapred.task.maxvm_, which
overrides _mapred.task.default.maxvm_ for all tasks for that job. 
* Furthermore, as described earlier in this Jira, we want to prevent users from setting _mapred.task.maxvm_
to an arbitrarily high number and thus gaming the system. To do this, there should be a cluster-wide
setting, _mapred.task.limit.maxvm_, that limits the value of _mapred.task.maxvm_. If _mapred.task.maxvm_
is set to a value higher than _mapred.task.limit.maxvm_, the job should not run. Either this
check can be done in the JT when a job is submitted, or a scheduler can fail the job if it
detects this situation. 

Note that the monitoring process can be disabled if _mapred.tasktracker.virtualmemory.reserved_
is not present, or has some default negative value. 


Scheduling
-----------------

In order to prevent tasks using too much memory, a scheduler can ensure that it limits the
number of tasks running on a node based on how much free memory is available and how much
a task needs. The Capacity Scheduler will do this, though we cannot enforce all schedulers
to support this feature. As per HADOOP-3759, TTs report, in each heartbeat, how much free
VM they have (which is equal to _max-VM-per-node_ minus the sum of _max-VM-per-task_ for each
running task). The Capacity Scheduler needs to ensure that: 
# there is enough VM for a new task to run. This it does by comparing the task's requirement
(its _max-VM-per-task_ limit) to the free VM available in the TT. 
# there is enough RAM available for a task so that there is not a lot of page swapping and
thrashing when tasks run. This is much harder to figure out and it's not even clear what it
means to have 'enough RAM available' for a task. A simple proposal, to get us started, is
to assume a fraction of the _max-VM-per-task_ limit as the 'RAM limit' for a task. Call this
the _max-RAM-per-task_ limit, and think of it as a scheduling limit. For a task to be scheduled,
its _max-RAM-per-task_ limit should be less than the total RAM on a TT minus the sum of _max-RAM-per-task_
limits of tasks running on the TT. This also implies that a TT should report its free RAM
(the total RAM on the node minus the sum of the _max-RAM-per-task_ limits for each running
task. 

Just as with the handling of VM, we may want to use a part of the RAM for scheduling TT tasks,
and not all of it. If so, we can introduce a config value, _mapred.tasktracker.ram.reserved_,
which indicates an offset (in MB?). The amount of RAM available to the TT tasks is then the
total RAM on the machine, minus this offset. How do we get the total RAM on the machine? By
the same plugin interface through which we obtain total VM. 

How do we specify a task's _max-RAM-per-task_ limit? There is a system-wide default value,
_mapred.capacity-scheduler.default.ramlimit_, expressed as a percentage. A task's default
_max-RAM-per-task_ limit is equal to the task's _max-VM-per-task_ limit times this value.
We may start by setting _mapred.capacity-scheduler.default.ramlimit_ to 50 or 33%. In order
to let individual jobs override this default, a job can set a config value, _mapred.task.maxram_,
expressed in MB, which then becomes the task's _max-RAM-per-task_ limit. Furthermore, as with
VM settings, we want to prevent users from setting _mapred.task.maxram_ to an arbitrarily
high number and thus gaming the system. To do this, there should be a cluster-wide setting,
_mapred.task.limit.maxram_, that limits the value of _mapred.task.maxram_. If _mapred.task.maxram_
is set to a value higher than _mapred.task.limit.maxram_, the job should not run. Either this
check can be done in the JT when a job is submitted, or a scheduler can fail the job if it
detects this situation. 

The Capacity Scheduler, when it picks a task to run, will check if both the task's RAM limit
and VM limit can be satisfied. If so, the task is given to the TT. If not, nothing is given
to the TT (i.e., the cluster blocks till at least one TT has enough memory). We will not block
forever because we limit what the task can ask for, and these limits should be set lower than
the RAM and VM on each TT. In order to tax users on their job's requirements, we may charge
them for what the value they set per task, but for now, there is no penalty associated with
the value set for mapred.task.maxmemory by a user for a job. 


Open issues
-------------------

Based on the writeup above, I'm summarizing a few of the open issues (mostly minor): 
# Should the memory-related config values be expressed in MB or GB or KB or just bytes? MB
sounds good to me. 
# If a job's specified VM or RAM task limit is higher than the max limit, that job shouldn't
be allowed to run. Should the JT reject the job when it is submitted, or should the scheduler
do it, by failing the job? The argument for the former is that these limits apply to all schedulers,
but then again, they are scheduling-based limits, so they maybe they should be done in each
of the schedulers. In the latter case, if a scheduler does not support scheduling based on
memory limits, it can just ignore these settings and run the job. So the latter option seems
better. 
# Should the Capacity Scheduler use the entire RAM of a TT when making a scheduling decision,
or an offset? Given that the RAM fractions are not very precise (they're based on fractions
of the VM), an offset doesn't make much of a difference (you could tweak _mapred.capacity-scheduler.default.ramlimit_
to achieve what the offset would), and adds an extra config value. At the same time, part
of the RAM is blocked for non-Hadoop stuff, and an offset does make things symmetrical. 


> Modify the capacity scheduler (HADOOP-3445) to schedule tasks based on memory requirements
and task trackers free memory
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4035
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4035
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>            Reporter: Hemanth Yamijala
>            Assignee: Vinod K V
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 4035.1.patch, HADOOP-4035-20080918.1.txt, HADOOP-4035-20081006.1.txt,
HADOOP-4035-20081006.txt, HADOOP-4035-20081008.txt
>
>
> HADOOP-3759 introduced configuration variables that can be used to specify memory requirements
for jobs, and also modified the tasktrackers to report their free memory. The capacity scheduler
in HADOOP-3445 should schedule tasks based on these parameters. A task that is scheduled on
a TT that uses more than the default amount of memory per slot can be viewed as effectively
using more than one slot, as it would decrease the amount of free memory on the TT by more
than the default amount while it runs. The scheduler should make the used capacity account
for this additional usage while enforcing limits, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message