hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart
Date Wed, 05 Mar 2014 22:06:48 GMT

     [ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated YARN-1338:
-----------------------------

    Attachment: YARN-1338.patch

Patch to recover the localized resource cache state when NM recovery is enabled.

There is a state store interface with two main implementations: a null store when recovery
is not enabled and a leveldb store when it is.

Note that resource reference counts are not explicitly persisted.  When containers are recovered
in YARN-1337 then the recovered containers will re-request their resources which will restore
the correct reference count state.  Even if we don't recover containers this is still useful
since it allows a NM to remember what resources have been localized across an NM restart.

One last thing that isn't persisted in the current patch is the resource reference timestamp
used for LRU sorting during a cache cleanup.  It's not needed for correctness but would be
nice to persist so we don't end up purging a recently used resource after an NM recovery.
 We could add that in a followup JIRA or I could update it as part of this one.

> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>
>                 Key: YARN-1338
>                 URL: https://issues.apache.org/jira/browse/YARN-1338
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1338.patch
>
>
> Today when node manager restarts we clean up all the distributed cache files from disk.
This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers are using
them
> * For even non work preserving restart this will be useful in the sense that we don't
have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message