hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
Date Tue, 23 Jun 2015 21:21:44 GMT

    [ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598384#comment-14598384
] 

Jason Lowe commented on YARN-2902:
----------------------------------

Thanks for updating the patch, Varun!

Is one second enough time for the localizer to tear down if the system is heavily loaded,
disks are slow, etc.?  I think it would be better for the executor to let us know when a localizer
has completed rather than assuming 1 second will be enough time (or too much time).  We can
tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors
are tracking localizers today.

There are a number of sleeps in the unit test which we should try to avoid if possible.  Is
there a reason dispatcher.await() isn't sufficient to avoid the races?  At a minimum there
should be a comment for each one explaining what we're trying to avoid by sleeping.

Nit: I've always interpreted the debug delay to be a delay to execute in debugging just before
the NM deletes a file.  To be consistent it seems that we should be adding the debug delay
to any requested delay.  That way the NM will always preserve a file for debugDelay seconds
_beyond_ what an NM with debugDelay=0 seconds would do.

Nit: The TODO in DeletionService about parent being owned by NM, etc. probably only needs
to be in the delete method that actually does the work rather than duplicated in veneer methods.

Nit: Should "Container killed while downloading" be "Container killed while localizing"? 
We use localizing elsewhere (e.g.: NM log UI when trying to get logs of a container that is
still localizing).

Nit: "Inorrect path for PRIVATE localization." should be "Incorrect path for PRIVATE localization:
" to fix typo and add trailing space for subsequent filename.  Missing a trailing space on
the next log message as well.  Realize this was just a pre-existing bug, but it would be nice
to fix as part of moving the code.



> Killing a container that is localizing can orphan resources in the DOWNLOADING state
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then resources
are left in the DOWNLOADING state.  If no other container comes along and requests these resources
they linger around with no reference counts but aren't cleaned up during normal cache cleanup
scans since it will never delete resources in the DOWNLOADING state even if their reference
count is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message