hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
Date Tue, 02 Feb 2016 06:06:40 GMT

    [ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127732#comment-15127732
] 

Varun Vasudev commented on YARN-3998:
-------------------------------------

My apologies for the slow response [~hex108]. Thank you for your patience. Comments on the
latest patch

1) The patch doesn't apply cleanly on trunk anymore

2) In ContainerLaunchContext.java, change "Retry strategy when container fails to run.”
to "Retry strategy when container exits with failure."

3) In yarn_protos.proto and ContainerRetryPolicy.java, rename ALWAYS_RETRY to RETRY_ON_ALL_ERRORS

4) In ApplicationMaster.java and Client.java, rename "retry_policy" to "container_retry_policy",
"error_codes" to "container_retry_error_codes", "max_retries" to "container_max_retries",
and "retry_interval" to "container_retry_interval"

5) In ContainerImpl.java, change 
{code}
+  LOG.info("Relaunch Container " + container.getContainerId()
+            + ". Remain retry attempts : " + container.remainingRetryAttempts
+            + ". Retry interval is "
+            + container.containerRetryContext.getRetryInterval() + "ms");
+        container.isRelaunch = true;
{code}
to
{code}
+  LOG.info("Relaunching container " + container.getContainerId()
+            + ". Remaining retry attempts(after relaunch) : " + container.remainingRetryAttempts
+            + ". Interval between retries is "
+            + container.containerRetryContext.getRetryInterval() + "ms");
+        container.isRelaunch = true;
{code}

6) In ContainerImpl.java, can we avoid creating a new thread if the interval is 0?
{code}
+        new Thread() {
+          @Override
+          public void run() {
+            try {
+              Thread.sleep(container.containerRetryContext.getRetryInterval());
+              container.sendLaunchEvent();
+            } catch (InterruptedException e) {
+              return;
+            }
+          }
+        }.start();
 {code}

7) In ContainerLaunch.java, change "If container is in relaunch" to "If container is being
relaunched"

8) In ContainerLaunch.java, rename container1 to targetContainer

9) In ContainerLaunch.java, in getContainerWorkDir, we should use dirsHandler.getLocalPathForWrite
instead of dirsHandler.getLocalPathForRead. As a result, 
{code}
+  public Path getLocalPathForRead(String pathStr) throws IOException {
+    return getPathToRead(pathStr, getLocalDirsForRead());
+  }
{code}
is not required

10) In ContainerLaunch.java, in getContainerLogDir, instead of matching stdout exactly, can
we use FileSystem.globStatus and match *out instead?

11) In the current version of the patch, the diagnosticInfo String is unbounded and there
is no indication as to which attempt the diagnosticInfo came from. Can you update the patch
to 
  - put some seperator(like "Diagnostic message from attempt:") in the diagnostic message

  - limit the size of the diagnostic message to some upper bound

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, YARN-3998.03.patch, YARN-3998.04.patch,
YARN-3998.05.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers,
it could specify the value. Then NM will re-launch the container 'retry-times' times when
it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not need to re-schedule
the container. And local files in container's working directory will be left for re-use.(If
container have downloaded some big files, it does not need to re-download them when running
again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message