hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
Date Mon, 26 Oct 2015 21:17:27 GMT

    [ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975095#comment-14975095
] 

Jason Lowe commented on YARN-2902:
----------------------------------

Thanks for updating the patch, Varun!

I'm not sure the logic in handleDownloadingRsrcsOnCleanup is desired.  IIUC when a container
is killed then all resources that were still pending will receive a resource failed event.
 Those resources will in turn send all containers that reference it a failed resource message,
and that will effectively kill the container.  This could be bad if the killed container was
localizing a resource for more than one container.  Today when the container is killed I believe
another outstanding container will simply acquire the lock on the scheduled resource and proceed
to re-attempt a localization on that resource.  If we do this logic then we'll kill all those
other referencing containers which isn't correct.  We do need to cleanup downloading resource
objects that have a zero refcount, but not ones that are still referenced by other containers.
 Those other containers should be able to continue localizing the resource if necessary.

With the following code, don't we risk failing to put the resource on the scheduled event
queue if something is thrown yet we will still hold the resource lock?  In the old code we
would always put the resource on the scheduled queue before fielding any errors, and the scheduled
queue processing would release the resource lock.
{code}
          if (nRsrc.tryAcquire()) {
            if (nRsrc.getState() == ResourceState.DOWNLOADING) {
              [...some code removed for brevity...]
              try {
                Path localizationPath = getPathForLocalization(next);
                synchronized (scheduled) {
                  scheduled.put(nextRsrc,
                      new ScheduledResource(evt, localizationPath));
                }
                return NodeManagerBuilderUtils.newResourceLocalizationSpec(next,
                    localizationPath);
              } catch (IOException e) {
                LOG.error("local path for PRIVATE localization could not be " +
                    "found. Disks might have failed.", e);
              } catch (IllegalArgumentException e) {
                LOG.error("Inorrect path for PRIVATE localization."
                    + next.getResource().getFile(), e);
              } catch (URISyntaxException e) {
                LOG.error(
                    "Got exception in parsing URL of LocalResource:"
                        + next.getResource(), e);
              }
            } else {
              // Need to release acquired lock
              nRsrc.unlock();
            }
          }
{code}

Actually do we even need all the code refactoring associated with the new ScheduledResource
class?  I think we can simply use the local path already accessible via event.getResource().getLocalPath()
in the cleanup code.  If that's not null then that means we sent the localizer that resource
in the heartbeat response, and those would be the ones to target for cleanup.

> Killing a container that is localizing can orphan resources in the DOWNLOADING state
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2902
>                 URL: https://issues.apache.org/jira/browse/YARN-2902
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch,
YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch
>
>
> If a container is in the process of localizing when it is stopped/killed then resources
are left in the DOWNLOADING state.  If no other container comes along and requests these resources
they linger around with no reference counts but aren't cleaned up during normal cache cleanup
scans since it will never delete resources in the DOWNLOADING state even if their reference
count is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message