hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Vinit Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
Date Tue, 12 Nov 2013 19:33:17 GMT

    [ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820378#comment-13820378

Omkar Vinit Joshi commented on YARN-1338:

Thanks [~jlowe] 
bq. I would rather not tie a checksum to this. Corruption of the file isn't related to whether
the NM is restarting, and it seems odd to only check for corruption on restart rather than
every time the resource is requested. IMHO we should treat checksums for localized resources
as an orthogonal feature request to this. (It would also significantly slow down the recovery
time if the NM had to checksum-compare everything in the distcache on startup.)
Yes I completely agree..checksum should be an additional feature rather than done as a part
of this. 

bq. So if we persist the LocalResourceRequest to LocalizedResource map then we can tell after
a recovery whether we already have the requested resource or not when a new request arrives.
Agreed. This way we will have all the information we need to reconstruct the cache. 

bq. We have a very rough start on persisting the local cache state, and I plan on working
on this in earnest in the next few weeks.
good ... 

any thoughts on how and when we are planning to store the container's resource request and
newly downloaded resource request to persistent store?
* clearly for resource request it should be quite clear. When download finishes and resource
is marked as LOCALIZED..we should save the info...(the way RMRestart is doing today for RMAppImpl...NEW...to...NEW_SAVING...to...SUBMITTED)
* But for container request it will become little bit tricky...
** When we initially get resource request for all the required resources during container
** or when individual resource request gets satisfied (as they are added to ref of LocalizedResource)
** or when for container all the resources are downloaded / localized?
3rd scenario looks good to me because 
* by then we will have information about all the localized resources. If downloading failed
for any of them then we frankly don't care about storing partial success so we can avoid this
* Also when container finishes / fails we can simply remove the entry
Any thoughts whether we want to avoid container start before we process all the writes to
store or can we start in parallel? Clearly parallel writes don't look good to me because if
any of the write events are in flight and nm restarts then after restart we won't know about
those changes..but at the same time if we wait for all the writes to go through then we are
delaying container start by that duration.

> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>                 Key: YARN-1338
>                 URL: https://issues.apache.org/jira/browse/YARN-1338
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
> Today when node manager restarts we clean up all the distributed cache files from disk.
This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers are using
> * For even non work preserving restart this will be useful in the sense that we don't
have to download them again if needed by future tasks.

This message was sent by Atlassian JIRA

View raw message