hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Ferguson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4334) Add support for CPU isolation/monitoring of containers
Date Mon, 16 Jul 2012 17:58:35 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415452#comment-13415452

Andrew Ferguson commented on MAPREDUCE-4334:

Hi Bikas, thanks for thinking about this! Comments inline:

bq. Somewhere in this thread it was mentioned controlling memory via OS. In my experience
this is not an optimal choice because
bq. 1) makes it hard to debug task failures due to memory issues. Abrupt OS termination or
denial or more memory resulting in NPE/bad pointers etc. Its better to just monitor the memory
and then enforce limits with clear error message saying - task was terminated because it used
more memory than alloted.

On Linux, enforcing memory limits via Cgroups feels a bit like simply running a process on
a machine with less memory installed. When the memory allocation is pushing the threshold,
the Linux OOM killer destroys the task. The patch above detects that the process has been
killed and logs a error message indicating that the task was killed for consuming too many

bq. 2) due to different scenarios, tasks may have memory spikes or temporary increases. The
OS will enforce tight limits but NodeManager monitoring can be more flexible and not terminate
a task because it shot to 2.1GB instead of staying under 2.

I would argue that the strict enforcement of Cgroups is exactly the behavior we want because
it provides isolation. If two containers are running on a node with 4 GB of RAM, and each
are using 2 GB, and one happens to spike to 3 GB momentarily, the spiking container should
suffer -- if we continue monitoring the memory as done today, then the well-behaved container
might suffer by being swapped-out to make room for the spiking container.

I believe the spiking concern is mitigated by the fact that Cgroups allows you to set both
a physical memory limit, and a virtual memory limit (which my patch above makes use of). For
example, I set the physical memory limit to say, 1 GB of RAM, and the virtual memory limit
to 2.1 GB. When a process momentarily spikes above it's 1 GB of RAM, it will be allocated
memory from swap without a problem. This is configurable by the already extant "yarn.nodemanager.vmem-pmem-ratio"

bq. Disk scheduling and monitoring would be a hard to achieve goal with multiple writers to
disk spinning things their own way and expecting something that will likely not happen.

Sure, it is tricky, and the feasibility depends on the semantics YARN promises applications.
However, the Linux Completely Fair Queuing I/O scheduler has semantics which are quite similar
to the semantics I'm proposing we promise for CPUs (proportional weights). The blkio Cgroup
subsystem already today provides both proportional sharing and throttling: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch-Subsystems_and_Tunable_Parameters.html#sec-blkio

bq. Network scheduling and monitoring shares choke points at multiple levels beyond the machines
and trying to optimally and proportionally use the network tends to be a problem thats better
served globally.

YARN is a global scheduler. Linux traffic controls [1], in combination with the network controller
for Cgroups, can be used to implement the results of Seawall [2], FairCloud [3], and similar
projects. There are many datacenter designs these days; some will be a perfect match for end-host-only
bandwidth control, and others an imperfect match. While end-host-only bandwidth control is
not a magic bullet, I strongly believe that it is both useful enough, and easy enough to implement,
to warrant pursuit.

bq. My 2 cents would be to limit this to just CPU for now.

It is. However, I believe the patch above is easily extensible to other resources (you can
see for yourself that there is a small difference between the memory-only patch, and the memory+cpu

bq. Based on the comments above, I would agree that we need to make sure platform specific
stuff should not leak into the code so that other platforms (imminently Windows) can support
this stuff.

Totally agree. That's why I proposed making it pluggable with MAPREDUCE-4351.

bq. An alternative to pluggable ContainersMonitor would be to make CPU management a pluggable
component of ContainersManager. My POV is that ContainersManager manages the resources of
containers and has logic that will be common across platforms. The tools it uses will change.
Eg. ProcfsBaseProcessTree is the tool used to monitor and manage memory. I can see that being
changed to a MemoryMonitor interface with platform specific implementations. Thats whats happening
on the Windows port in branch 1. I can see a CPUMonitor interface for CPU. Or maybe a ResourceMonitor
that has methods for both memory and CPU.

I'm afraid I'm a bit confused by your suggestion here -- ContainersMonitor is already a part
of the ContainersManager. Are you proposing that we create a pluggable interface for each
type of resource independently? Perhaps you can point me to the code & branch which has
the suggestion you are describing? There are two pieces to resource management: monitoring
& enforcement, and both are platform-specific. Because multiple Linux enforcement solutions
(the current Java-native, the above Cgroups, and the planned taskset) can all use the same
Linux-specific monitoring code, it seems reasonable to keep the two features separate. The
monitoring code is already pluggable (ResourceCalculatorPlugin).


[1] http://lartc.org/howto/ and 'man tc'
[2] http://research.microsoft.com/en-us/UM/people/srikanth/data/nsdi11_seawall.pdf
[3] http://www.hpl.hp.com/people/lucian_popa/faircloud_hotnets.pdf
> Add support for CPU isolation/monitoring of containers
> ------------------------------------------------------
>                 Key: MAPREDUCE-4334
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4334
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>            Reporter: Arun C Murthy
>            Assignee: Andrew Ferguson
>         Attachments: MAPREDUCE-4334-pre1.patch, MAPREDUCE-4334-pre2-with_cpu.patch, MAPREDUCE-4334-pre2.patch,
MAPREDUCE-4334-pre3-with_cpu.patch, MAPREDUCE-4334-pre3.patch
> Once we get in MAPREDUCE-4327, it will be important to actually enforce limits on CPU
consumption of containers. 
> Several options spring to mind:
> # taskset (RHEL5+)
> # cgroups (RHEL6+)

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message