Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8C600200D11 for ; Mon, 2 Oct 2017 22:02:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8AA8F1609C0; Mon, 2 Oct 2017 20:02:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D1D631609EF for ; Mon, 2 Oct 2017 22:02:06 +0200 (CEST) Received: (qmail 16475 invoked by uid 500); 2 Oct 2017 20:02:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 16462 invoked by uid 99); 2 Oct 2017 20:02:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Oct 2017 20:02:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 21D281A6000 for ; Mon, 2 Oct 2017 20:02:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id zoSzFlGcotFv for ; Mon, 2 Oct 2017 20:02:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 97CD561116 for ; Mon, 2 Oct 2017 20:02:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D9427E105A for ; Mon, 2 Oct 2017 20:02:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 38F85242D9 for ; Mon, 2 Oct 2017 20:02:01 +0000 (UTC) Date: Mon, 2 Oct 2017 20:02:01 +0000 (UTC) From: "Eric Badger (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-7278) LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 02 Oct 2017 20:02:07 -0000 [ https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16188731#comment-16188731 ] Eric Badger commented on YARN-7278: ----------------------------------- The affect version is set to 2.7.1. So is this a bug related to DockerContainerExecutor? DockerContainerExecutor has been deprecated in 2.9 and removed in 3.0. If this is a problem with DockerLinuxContainerRuntime, then the affect version shouldn't be set to 2.7.1. > LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow. > -------------------------------------------------------------------------------------------------------------- > > Key: YARN-7278 > URL: https://issues.apache.org/jira/browse/YARN-7278 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.7.1 > Environment: CentOS > Reporter: zhengchenyu > Fix For: 2.9.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer with docker mode. > Container may be failed when nodemanager restart, exception is below: > {code} > [2017-09-29T15:47:14.433+08:00] [INFO] containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java 472) [Container Monitor] : Memory usage of ProcessTree 120523 for container-id container_1506600355508_0023_01_000004: -1B of 10 GB physical memory used; -1B of 31 GB virtual memory used > [2017-09-29T15:47:15.219+08:00] [ERROR] containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java 93) [ContainersLauncher #1] : Unable to recover container container_1506600355508_0023_01_000004 > java.io.IOException: Timeout while waiting for exit code from container_1506600355508_0023_01_000004 > [2017-09-29T15:47:15.220+08:00] [INFO] containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) [AsyncDispatcher event handler] : Container container_1506600355508_0023_01_000004 transitioned from RUNNING to EXITED_WITH_FAILURE > [2017-09-29T15:47:15.221+08:00] [INFO] containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java 440) [AsyncDispatcher event handler] : Cleaning up container container_1506600355508_0023_01_000004 > {code} > I guess the proccess is done, but 2 seconde later( the variable is msecLeft), the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The container is succeed when nodemanger is restart. > So I think it is too short for docker container to complete the work. > In docker mode of LinuxContainer, nm monitor the real task which is launched by "docker run" command. Then "docker wait" command will wait for exitcode, then "docker rm" will delete the docker container. Lastly, container-executor will write the exit code. So if some docker command is slow enough, nm wouldn't monitor the container. In fact, docker rm is always slow. > I think the exit code of docker rm dosen't matter with the real task, so I think we could move the operation of write "*.pid.exitcode" before the command of docker rm. Or monitor the docker wait proccess, but not the real task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org