hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-257) NM should gracefully handle a full local disk
Date Tue, 04 Mar 2014 11:36:22 GMT

    [ https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919295#comment-13919295

Sunil G commented on YARN-257:

May be NM can do some level of handling by itself in Disk Full scenario as in first place.
NM's LocalDirAllocator gives a local path to write from the "good" list of directories.
But for this, it uses a round robin algorithm based on space available.

In a scenario like below, if more tasks asks for path from the set of local directories, 
then it is possible that the allocation is done based on the current availability at that
given time.
But this path would have earlier given to some other tasks to write and they may be sequentially
doing writing.

Basically the allotted space is not considered when next allocation is given for another task
from same path. 
[Assuming few earlier allocated tasks is doing write at this time]

But it is not possible to consider this earlier allotted space and it is not possible to predict
the disk write speed.

Could it be possible to predict disk full scenario rather than acting on when it happens.
For Eg, current health check mechanism will check access permission etc to identify and good
and bad directories for 2 minute interval.
Here if the space is almost full (say 95% or only 5*100Mb is remaining), then it is better
to move that directory to bad list directories.

Or in the LocalDirAllocator, it is better to check for high percentage of disk used. And do
not assign such a directory to that task.
These measures might possible help to resolve the new tasks not to fail because of an immediate
disk full scenario.

> NM should gracefully handle a full local disk
> ---------------------------------------------
>                 Key: YARN-257
>                 URL: https://issues.apache.org/jira/browse/YARN-257
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
> When a local disk becomes full, the node will fail every container launched on it because
the container is unable to localize.  It tries to create an app-specific directory for each
local and log directories.  If any of those directory creates fail (due to lack of free space)
the container fails.
> It would be nice if the node could continue to launch containers using the space available
on other disks rather than failing all containers trying to launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming full rather
than the disk failing.

This message was sent by Atlassian JIRA

View raw message