hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
Date Wed, 07 Jan 2015 22:33:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268403#comment-14268403

Jason Lowe commented on YARN-2902:

Thanks for the patch, Varun!

I think the patch will prevent us from leaking the bookkeeping in resource trackers for resources
in the downloading state, but it relies on the periodic retention checking and doesn't address
the leaking of data on the disk.  The localizer has probably created a partial *_tmp file/dir
for the download that didn't complete, and we should be cleaning that up as well.  As is we
won't try to clean up any leaked DOWNLOADING resource until the retention process runs (on
the order of tens of minutes), but we shouldn't need to wait around to reap resources that
aren't really downloading.

I haven't had time to work this all the way through, but I'm wondering if we're patching the
symptoms rather than the root cause.  The resource is lingering around in the DOWNLOADING
state because a container was killed and we then "forgot" the corresponding localizer that
was associated with the container. When the localizer later hearbeats in the NM tells the
unknown localizer to DIE and that ultimately is what leads to a resource lingering around
in the DOWNLOADING state.  I think we should be properly cleaning up localizers corresponding
to killed containers and sending appropriate events to the LocalizedResources.  This will
then cause the resources to transition out of the DOWNLOADING state to something appropriate,
sending the proper events to any other containers that are pending on that resource.  At that
point we can also clean up any leaked _tmp files/dirs from the failed/killed localizer.

> Killing a container that is localizing can orphan resources in the DOWNLOADING state
> ------------------------------------------------------------------------------------
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>             Fix For: 2.7.0
>         Attachments: YARN-2902.002.patch, YARN-2902.patch
> If a container is in the process of localizing when it is stopped/killed then resources
are left in the DOWNLOADING state.  If no other container comes along and requests these resources
they linger around with no reference counts but aren't cleaned up during normal cache cleanup
scans since it will never delete resources in the DOWNLOADING state even if their reference
count is zero.

This message was sent by Atlassian JIRA

View raw message