hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
Date Fri, 18 Sep 2015 14:53:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875721#comment-14875721

Jason Lowe commented on YARN-2902:

bq. In container localizer, when processing HB DIE response, we send another localizer status
to NM. Is it really required ? What do you think ?

I don't think this is required.  If the NM is telling the localizer to DIE then I don't think
the NM cares after that point what the localizer is doing.  The NM is totally done with it
at that point due to failure or lack of knowledge of that localizer.

bq. Or are you suggesting System exit ?
I was basically suggesting System.exit if we aren't convinced that the localizer can actually
tear down in a timely manner.  For example, if the graceful shutdown could involve waiting
for active transfers to complete because we can't reliably interrupt them, then yes I think
System.exit is appropriate.  A good compromise would be to put a timeout on shutdown -- if
we can't get down within so many seconds then have something (e.g.: a watchdog thread if necessary)
call System.exit to get out.  Otherwise the localizer could still be running and messing with
the filesystem after the NM tries to cleanup afterwards.

Worst-case scenario is this could still happen even with these fixes, but it should resolve
the leaking issue for the vast majority of cases.  We can make it more bulletproof in a followup
JIRA for 2.8 or later that actually has the NM tracking localizer pids and proactively killing
them if they don't respond in a timely manner to commands.

bq. However we can also let localizer not do any cleanup at all and let NM delete paths.
I would still like the localizer to try to perform some cleanup if possible, as the NM doesn't
track localizers in the state store.  Therefore if the NM restarts we may not cleanup everything
properly if the localizer doesn't do it on its own.

> Killing a container that is localizing can orphan resources in the DOWNLOADING state
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch,
YARN-2902.06.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then resources
are left in the DOWNLOADING state.  If no other container comes along and requests these resources
they linger around with no reference counts but aren't cleaned up during normal cache cleanup
scans since it will never delete resources in the DOWNLOADING state even if their reference
count is zero.

This message was sent by Atlassian JIRA

View raw message