hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
Date Tue, 15 Sep 2015 22:34:47 GMT

    [ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746398#comment-14746398

Jason Lowe commented on YARN-2902:

Wow, the patch has gotten a lot larger.  Will take me a while to fully digest all of the changes.

bq. Would we want this to go into 2.7.2 ?
Ideally yes.  We're seeing significant leakage of local resources on many of our nodes due
to this problem.

Back to the patch itself, at a quick glance I'm lukewarm on a couple of points.  I'm not thrilled
with adding yet another config that users need to tune properly or it doesn't work correctly.
 Container localizers should already have the concept of heartbeating and killing themselves
if they don't hear from the NM within X seconds, and likewise the NM should kill localizers
that don't heartbeat in a timely fashion.  It seems to me that's the time interval we should
be using to determine how long to wait before giving up and having the NM do the cleanup.

I'm also not sure we need deletion task cancellation.  As you point out it's not really necessary.
 The files being deleted should not be reused later, so there should be no harm in attempting
to redundantly delete them.  If we decide that we really should have cancellation of deletion
tasks then that should be implemented as a separate JIRA to help keep this patch more manageable
since that's a readily separable feature that can stand on its own.

Also do we really need a flag to say whether we want it to ignore missing paths?  Wondering
if we should just ignore cases where the path doesn't exist.  I can see it either way I guess.

I wonder if we can solve this with a simpler approach that should work well in most cases.
 What if we have the localizer register the temporary working directory (i.e.: the _tmp paths)
as deleteOnExit paths?  Then the localizer should try to clean these up if there's anything
at all resembling a "normal" JVM exit, and I believe the JVM ignores cases when the file is
already missing. Then we only have to worry about the case where the localizer dies a horrible
death and the JVM doesn't get a chance to cleanup.  To cover that case, when the NM kills
the localizer we also schedule a deletion of the dest and dest_tmp paths.  With this I don't
think we need to change the localizer protocol -- DIE means try to cleanup, but NM will always
cleanup anyway so no need to wait around and try too hard.  Its actually more important that
the localizer gets out of the way in a timely manner than it is for it to cleanup since the
NM will be the backup in case the localizer fails.

> Killing a container that is localizing can orphan resources in the DOWNLOADING state
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch,
YARN-2902.06.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then resources
are left in the DOWNLOADING state.  If no other container comes along and requests these resources
they linger around with no reference counts but aren't cleaned up during normal cache cleanup
scans since it will never delete resources in the DOWNLOADING state even if their reference
count is zero.

This message was sent by Atlassian JIRA

View raw message