hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart
Date Wed, 28 May 2014 16:33:03 GMT

     [ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe updated YARN-1338:

    Attachment: YARN-1338v6.patch

Thanks for the additional comments, Junping.

bq. Do we have any code to destroy DB items for NMState when NM is decommissioned (not expecting
short-term restart)?

Good point.  I added shutdown code that removes the recovery directory if the shutdown is
due to a decommission.  I also added a unit test for this scenario.

In LocalResourcesTrackerImpl#recoverResource()

+    incrementFileCountForLocalCacheDirectory(localDir.getParent());

Given localDir is already the parent of localPath, may be we should just increment locaDir
rather than its parent? I didn't see we have unit test to check file count for resource directory
after recovery. May be we should add some?

The last component of localDir is the unique resource ID and not a directory managed by the
local cache directory manager.  The directory allocated by the local cache directory manager
has an additional directory added by the localization process which is named after the unique
ID for the local resource.  For example, the localPath might be something like /local/root/0/1/52/resource.jar
and localDir is /local/root/0/1/52.  The '52' is the unique resource ID (always >= 10 so
it can't conflict with single-character cache mgr subdirs) and /local/root/0/1 is the directory
managed by the local dir cache manager.  If we passed localDir to the local dir cache manager
it would get confused since it would try to parse the last component as a subdirectory it
created but it isn't that.

I did add a unit test to verify local cache directory counts are incremented properly when
resources are recovered.  This required exposing a couple of methods as package-private to
get the necessary information for the test.

> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>                 Key: YARN-1338
>                 URL: https://issues.apache.org/jira/browse/YARN-1338
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1338.patch, YARN-1338v2.patch, YARN-1338v3-and-YARN-1987.patch,
YARN-1338v4.patch, YARN-1338v5.patch, YARN-1338v6.patch
> Today when node manager restarts we clean up all the distributed cache files from disk.
This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers are using
> * For even non work preserving restart this will be useful in the sense that we don't
have to download them again if needed by future tasks.

This message was sent by Atlassian JIRA

View raw message