hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-257) NM should gracefully handle a full local disk
Date Wed, 05 Dec 2012 22:17:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510842#comment-13510842

Jason Lowe commented on YARN-257:

bq. Before the complete change, would it help if the NM did not accept new containers. Maybe
by indicating in the heartbeat that do not assign containers to it.

Yes, it would be nice sometimes if a node could declare itself as being UNHEALTHY without
causing all containers currently running to be shot as it does now.  Sort of a "let's drain
the currently running containers but not allow any new ones" mode.

bq. Why does the RM not notice abnormal failure rates on such an NM and put it out of rotation
for scheduling?

Currently the RM doesn't track container failures on nodes for purposes of blacklisting them.
 AFAIK nodes can only be blacklisted by an RM by self-declaring themselves as UNHEALTHY via
the health checker script that they run.  The MR AM is already tracking such things, but I
don't beleive there's a feedback mechanism from the AM to the RM to help the RM figure out
which nodes are bad from an AM's perspective.  Might be nice to have, and YARN-195 covers
this to some extent.

As you indicate the RM could also check container failures solely via container status from
the NMs and blacklist NMs based on some algorithm.  We need to be careful that a misconfigured
large job doesn't end up blacklisting a large chunk of the cluster because all of its containers
fail.  Think bad parameters on mapreduce.map.java.opts, for example, or a case where it doesn't
get the classpath for its tasks correct.  And not all container failures from an AMs point
of view are visible from the RM watching container status.  The container could exit cleanly
but still fail at the app-level, for example.  So we might need both mechanisms.

> NM should gracefully handle a full local disk
> ---------------------------------------------
>                 Key: YARN-257
>                 URL: https://issues.apache.org/jira/browse/YARN-257
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
> When a local disk becomes full, the node will fail every container launched on it because
the container is unable to localize.  It tries to create an app-specific directory for each
local and log directories.  If any of those directory creates fail (due to lack of free space)
the container fails.
> It would be nice if the node could continue to launch containers using the space available
on other disks rather than failing all containers trying to launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming full rather
than the disk failing.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message