hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart
Date Wed, 05 Mar 2014 22:06:48 GMT

     [ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe updated YARN-1338:

    Attachment: YARN-1338.patch

Patch to recover the localized resource cache state when NM recovery is enabled.

There is a state store interface with two main implementations: a null store when recovery
is not enabled and a leveldb store when it is.

Note that resource reference counts are not explicitly persisted.  When containers are recovered
in YARN-1337 then the recovered containers will re-request their resources which will restore
the correct reference count state.  Even if we don't recover containers this is still useful
since it allows a NM to remember what resources have been localized across an NM restart.

One last thing that isn't persisted in the current patch is the resource reference timestamp
used for LRU sorting during a cache cleanup.  It's not needed for correctness but would be
nice to persist so we don't end up purging a recently used resource after an NM recovery.
 We could add that in a followup JIRA or I could update it as part of this one.

> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>                 Key: YARN-1338
>                 URL: https://issues.apache.org/jira/browse/YARN-1338
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1338.patch
> Today when node manager restarts we clean up all the distributed cache files from disk.
This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers are using
> * For even non work preserving restart this will be useful in the sense that we don't
have to download them again if needed by future tasks.

This message was sent by Atlassian JIRA

View raw message