hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szilard Nemeth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4599) Set OOM control for memory cgroups
Date Tue, 15 May 2018 07:57:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16475453#comment-16475453
] 

Szilard Nemeth commented on YARN-4599:
--------------------------------------

Thanks [~miklos.szegedi@cloudera.com] for the patch.

Couple of comments, some of them are minor, some of them could be important:

 

1. I think you have a typo for the property: yarn.nodemanager.elastic-memory-control.enabled
in yarn-default.xml 
"if all the containers exceeded the a limit." --> should be changed to: "...exceeded the
limit." (without an "a")

2. I think time unit should be provided for the property yarn.nodemanager.elastic-memory-control.timeout
in yarn-default.xml.
>From the description, I guess the value 5 means seconds from the code, but only from the
description it's ambiguous


3. {{CGroupElasticMemoryController}}:

A.) In constructor: I would create a separate method for create and return the OOM handler
instance as the constructor is very crowded.
As I see it, the {{oomHandlerClass}} and the instantiation logic both could be extracted
to the method and then you don't need the {{oomHandlerTemp}} variable at all.

B.) In constructor: 
I think it's cleaner to use YarnConfiguration constants in string messages (log messages,
exception message).
E.g. {{LOG.warn("yarn.nodemanager.elastic-memory-control.enabled}}... --> You could use
{{YarnConfiguration.NM_ELASTIC_MEMORY_CONTROL_ENABLED}} instead.
If you search for "yarn.nodemanager." in this class it will give you all the results I'm talking
about.

C.) [Question] In constructor: If both {{controlPhysicalMemory}} and {{controlVirtualMemory}} is
on, you only warn log a line.
How the NM will work in these conditions, in other words, is it enough to just log a warning
and continue NM's execution?

D.) In {{getOOMListenerExecutablePath()}}: You print OOM listener's path on debug level with:

{code:java}
conf.get(
 YarnConfiguration.NM_ELASTIC_MEMORY_CONTROL_OOM_LISTENER_PATH,
 defaultPath){code}

I think you should extract this to a final local variable as you are returning the same value.

E.) In run(): On the first line of this method, you have a typo in {{oomNotResoved}} -->
should be {{oomNotResolved}}.

F.) In run(): You have a TODO in this method. Do you have any plans to resolve it or it will
be resolved later?

G.) In run(): I'm curious whether it can happen that in the while loop's statement, {{events.read}} would
read more data than 8 bytes or is it perfectly safe to rely on that on every read, at most
8 bytes will be read?

H.) In run(): Since can harldy fit on one screen, I would extract the substantive part of
the run() method to a new method, i.e. from the first {{executor.submit()}} call (errorListener)
to the last statement until the catch.

I.) In run(): {{throw new YarnRuntimeException("OOM was not resolved in time.");}}} -->
I would include how much time was spent in the log message for better troubleshooting.

J.) Javadoc of {{watchAndLogOOMState}} return is wrong, it returns a boolean and not an empty
string.

K.) In {{setCGroupParameters()}}: Maybe in the {{ResourceHandlerException}}'s message would
be more understandable with whole words, i.e. instead of p/v abbreviations, I would use physical/virtual.

L.) In {{watchAndLogOOMState()}}: Please extract {{"under_oom 1"}} to a static final String.
This string is also used in {{DefaultOOMHandler}} so you should use the same constant.


4. {{DefaultOOMHandler}}

A.) There are several TODOs in this class, what is the plan with those?
A.) Constructor could be package-private
B.) (minor) Javadoc of run() has a typo, should be: "...exceeded its limits" instead of "it's
limits". There are several other typos like this in this class.
B.) In run(): Call to

 
{code:java}
ArrayList<Container> containers = new ArrayList<>();
 containers.addAll(context.getContainers().values());
{code}
could be this one liner: 
{code:java}
ArrayList<Container> containers = new ArrayList<>(context.getContainers().values());{code}
C.) In {{killContainerIfOOM()}}: 
This line specifically:
{code:java}
long request = container.getResource().getMemorySize() * 1024 * 1024;{code}

According to javadoc of {{org.apache.hadoop.yarn.api.records.Resource#getMemorySize}}, this
will return MB of memory size as a default and it depends on the unit of the resource.
You multiply it by 1024 two times, so I guess you are relying on that memory size would be
bytes instead. 
*This one seems a possible unit conversion issue for me.*

5.) {{TestCGroupElasticMemoryController}}:

A.) {{testConstructorHandler()}}: It's not clear for me why do you expect {{YarnException}} with
this test.
Moreover, {{YarnException}} could be thrown if the {{CGroupMemoryController}}'s constructor
could not create an instance of the OOM handler which could hide a code error. I would throw
a special exception from {{DummyRunnable}}'s constructor and assert on that exception from
the testcase in order to be sure that the OOM handler's constructor is invoked without an
issue.

> Set OOM control for memory cgroups
> ----------------------------------
>
>                 Key: YARN-4599
>                 URL: https://issues.apache.org/jira/browse/YARN-4599
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.9.0
>            Reporter: Karthik Kambatla
>            Assignee: Miklos Szegedi
>            Priority: Major
>              Labels: oct16-medium
>         Attachments: Elastic Memory Control in YARN.pdf, YARN-4599.000.patch, YARN-4599.001.patch,
YARN-4599.002.patch, YARN-4599.003.patch, YARN-4599.004.patch, YARN-4599.005.patch, YARN-4599.006.patch,
YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly set OOM control
so that containers are not killed as soon as they go over their usage. Today, one could set
the swappiness to control this, but clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message