hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashwin Shankar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4011) Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk
Date Mon, 03 Aug 2015 19:10:04 GMT

    [ https://issues.apache.org/jira/browse/YARN-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652293#comment-14652293

Ashwin Shankar commented on YARN-4011:

hey [~jlowe], have you encountered this issue before at Yahoo ? Also would it make sense to
have a feature on NM to limit the amount of data
user/app can write to nm-local-dir to protect other users ? I'm looking into related jiras
like YARN-1781, which could be a band-aid to this problem.

> Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk
> ------------------------------------------------------------------------
>                 Key: YARN-4011
>                 URL: https://issues.apache.org/jira/browse/YARN-4011
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.4.0
>            Reporter: Ashwin Shankar
> We observed jobs failed since tasks couldn't launch on nodes due to "java.io.IOException
No space left on device". 
> On digging in further, we found a rogue job which filled up disk.
> Specifically it was wrote a lot of map spills(like attempt_1432082376223_461647_m_000421_0_spill_10000.out)
to nm-local-dir causing disk to fill up, and it failed/got killed, but didn't clean up its
files in nm-local-dir.
> So the disk remained full, causing subsequent jobs to fail.

This message was sent by Atlassian JIRA

View raw message