hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2249) AM release request may be lost on RM restart
Date Mon, 18 Aug 2014 20:45:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101250#comment-14101250

Zhijie Shen commented on YARN-2249:

1. Do the following in AbstractYarnScheduler.serviceInit?
+    super.nmExpireInterval =
+        conf.getInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS,
+          YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS);
+    createReleaseCache();

2. Add RM_NM_EXPIRY_INTERVAL_MS in yarn-default.xml?

3. Not sure it's going to be an efficient data structure. Different apps' containers should
not affect each other, right? "mutex" on the whole collection seems to be a too coarse granularity
(blocking allocate call). Should we use Map<AppAttemptId, List<ContainerId>> and
make each app have separate mutex?
+  private Set<ContainerId> pendingRelease = null;
+  private final Object mutex = new Object();

> AM release request may be lost on RM restart
> --------------------------------------------
>                 Key: YARN-2249
>                 URL: https://issues.apache.org/jira/browse/YARN-2249
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch,
YARN-2249.3.patch, YARN-2249.4.patch
> AM resync on RM restart will send outstanding container release requests back to the
new RM. In the meantime, NMs report the container statuses back to RM to recover the containers.
If RM receives the container release request  before the container is actually recovered in
scheduler, the container won't be released and the release request will be lost.

This message was sent by Atlassian JIRA

View raw message