hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rand (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-8903) when NM becomes unhealthy due to local disk usage, have option to kill application using most space instead of releasing all containers on node
Date Thu, 18 Oct 2018 04:44:00 GMT
Steven Rand created YARN-8903:
---------------------------------

             Summary: when NM becomes unhealthy due to local disk usage, have option to kill
application using most space instead of releasing all containers on node
                 Key: YARN-8903
                 URL: https://issues.apache.org/jira/browse/YARN-8903
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: nodemanager
    Affects Versions: 3.1.1
            Reporter: Steven Rand


We sometimes experience an issue in which a single application, usually a Spark job, causes
at least one node in a YARN cluster to become unhealthy by filling up the local dir(s) on
that node past the threshold at which the node is considered unhealthy.

When this happens, the impact is potentially large depending on what else is running on that
node, as all containers on that node are lost. Sometimes not much else is running on the node
and it's fine, but other times we lose AM containers from other apps and/or non-AM containers
with long-running tasks.

I thought that it would be helpful to add an option (default false) whereby if a node is going
to become unhealthy due to full local disk(s), it instead identifies the application that's
using the most local disk space on that node, and kills that application. (Roughly analogous
to how the OOM killer in Linux picks one process to kill rather than letting the machine crash.)

The benefit is that only one application is impacted, and no other application loses any containers.
This prevents one user's poorly written code that shuffles/spills huge amounts of data from
negatively impacting other users.

The downside is that we're killing the entire application, not just the task(s) responsible
for the local disk usage. I believe it's necessary to kill the whole application instead of
identifying the container running the relevant task(s), because doing so would require more
knowledge of the internal state of aux services responsible for shuffling than what YARN has
according to my understanding.

If this seems reasonable, I can work on the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message