hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1337) Recover containers upon nodemanager restart
Date Fri, 08 Aug 2014 03:28:12 GMT

    [ https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090254#comment-14090254
] 

Junping Du commented on YARN-1337:
----------------------------------

Thanks [~jlowe] for contributing a patch! I have some initiative comments, more may comes
later.
In ContainerExecutor.java,
{code}
+    if (pidPath == null) {
+      LOG.info(containerId + " is not active, returning terminated error");
+      return ExitCode.TERMINATED.getExitCode();
+    }
{code}
May be LOG.warn is a better option here?

{code}
+    while (!file.exists() && msecLeft > 0) {
+      if (!isContainerActive(containerId)) {
+        LOG.info(containerId + " was deactivated");
+        return ExitCode.TERMINATED.getExitCode();
+      }
+      try {
+        Thread.sleep(sleepMsec);
+      } catch (InterruptedException e) {
+        throw new IOException(
+            "Interrupted while waiting for exit code from " + containerId, e);
+      }
+      msecLeft -= sleepMsec;
+    }
+    if (msecLeft < 0) {
+      throw new IOException("Timeout while waiting for exit code from "
+          + containerId);
+    }
{code}
What about msecLeft = 0? the logic get quit from while loop but not throw exception, better
to be msecLeft >= 0.

{code}
+      // TODO: exit code script for Windows
{code}
We should open a JIRA for this?

In NodeStatusUpdaterImpl.java,
{code}
+          try {
+            context.getNMStateStore().removeContainer(cid);
+          } catch (IOException e) {
+            LOG.error("Unable to remove container " + cid + " in store", e);
+          }
{code}
Again, what would happen if container get removed failed (and other actions, i.e. store, etc.)?


{code}
+    final boolean delayedRpcServerStart = (initialAddress.getPort() != 0);
{code}
We mark NM port to be 0 for identifying if delayedRpcServerStart. Does this sound a little
tricky? May be replace it with a new configuration?

{code}
--- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
+++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
@@ -45,4 +45,3 @@ message LocalizedResourceProto {
   optional string localPath = 2;
   optional int64 size = 3;
 }
-
{code}
Unnecessary change?

In RMNodeImpl.java,
{code}
-        if (rmNode.getState() != NodeState.UNHEALTHY) {
-          // Only add new node if old state is not UNHEALTHY
-          rmNode.context.getDispatcher().getEventHandler().handle(
-              new NodeAddedSchedulerEvent(rmNode));
-        }
       } else {
+        // Kill containers since node is rejoining.
+        rmNode.nodeUpdateQueue.clear();
+        rmNode.context.getDispatcher().getEventHandler().handle(
+            new NodeRemovedSchedulerEvent(rmNode));
{code}
It could cause trouble here if we allow NM’s resource get changed (when YARN-291 get done)
during NM restart. We may just remove the killing container code rather than move it to else
where?

> Recover containers upon nodemanager restart
> -------------------------------------------
>
>                 Key: YARN-1337
>                 URL: https://issues.apache.org/jira/browse/YARN-1337
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1337-v1.patch
>
>
> To support work-preserving NM restart we need to recover the state of the containers
when the nodemanager went down.  This includes informing the RM of containers that have exited
in the interim and a strategy for dealing with the exit codes from those containers along
with how to reacquire the active containers and determine their exit codes when they terminate.
 The state of finished containers also needs to be recovered.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message