hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleksandr Kalinin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-5140) NM usercache fill up with burst of jobs leads to NM outage
Date Tue, 24 May 2016 21:57:13 GMT
Oleksandr Kalinin created YARN-5140:
---------------------------------------

             Summary: NM usercache fill up with burst of jobs leads to NM outage
                 Key: YARN-5140
                 URL: https://issues.apache.org/jira/browse/YARN-5140
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.7.0
         Environment: Linux RHEL 6.7, Hadoop 2.7.0


            Reporter: Oleksandr Kalinin


A burst or rapid rate of submitted jobs with substantial NM usercache resource localization
footprint may lead to rapid fill up of the NM local temporary IO FS (/tmp by default) with
negative consequences in terms of stability.

The core issue seems to be the fact that NM continues to localize the resources beyond the
maximum local cache size (yarn.nodemanager.localizer.cache.target-size-mb , default 10G).
Since maximum local cache size is effectively not taken into account when localizing new resources
(note that default cache cleanup interval is 10 min controlled by yarn.nodemanager.localizer.cache.cleanup.interval-ms),
this basically leads to sort of self-destruction scenario : once /tmp FS utilization reaches
the threshold of 90%, NM will automatically de-register from RM, effectively leading to NM
outage.

This issue may offline many NMs simultaneously at the same time and thus is quite critical
in terms of platform stability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message