Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 756AF200D14 for ; Tue, 19 Sep 2017 01:28:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 741081609DB; Mon, 18 Sep 2017 23:28:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 923491609DE for ; Tue, 19 Sep 2017 01:28:05 +0200 (CEST) Received: (qmail 3376 invoked by uid 500); 18 Sep 2017 23:28:04 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 3342 invoked by uid 99); 18 Sep 2017 23:28:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Sep 2017 23:28:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 32439182B51 for ; Mon, 18 Sep 2017 23:28:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id joWo3ZvC_55B for ; Mon, 18 Sep 2017 23:28:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 46B1E5F3CF for ; Mon, 18 Sep 2017 23:28:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 70C3CE06BB for ; Mon, 18 Sep 2017 23:28:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2802D24143 for ; Mon, 18 Sep 2017 23:28:00 +0000 (UTC) Date: Mon, 18 Sep 2017 23:28:00 +0000 (UTC) From: "James Peach (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 18 Sep 2017 23:28:06 -0000 [ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16170890#comment-16170890 ] James Peach commented on MESOS-7963: ------------------------------------ /cc [~jieyu] This covers the executor container limitation we discussed on the Slack channel. > Task groups can lose the container limitation status. > ----------------------------------------------------- > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor > Reporter: James Peach > > If you run a single task in a task group and that task fails with a container limitation, that status update can be lost and only the executor failure will be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M count=64 ; sleep 10000" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }], > "command": { > "value": "sleep 600" > } > }, { > "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d", > "task_id": {"value" : "a7571608-3a53-4971-a187-41ed8be183ba"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M count=64 ; sleep 10000" > } > } > ] > }' > I0911 12:29:17.772161 7655 scheduler.cpp:184] Version: 1.5.0 > I0911 12:29:17.780640 7661 scheduler.cpp:470] New master detected at master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0011 > Submitted task group with tasks [ 1372b2e2-c501-4e80-bcbd-1a5c5194e206, a7571608-3a53-4971-a187-41ed8be183ba ] to agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task '1372b2e2-c501-4e80-bcbd-1a5c5194e206' > source: SOURCE_EXECUTOR > Received status update TASK_RUNNING for task 'a7571608-3a53-4971-a187-41ed8be183ba' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task '1372b2e2-c501-4e80-bcbd-1a5c5194e206' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task 'a7571608-3a53-4971-a187-41ed8be183ba' > message: 'Disk usage (65556KB) exceeds quota (34MB)' > source: SOURCE_AGENT > reason: REASON_CONTAINER_LIMITATION_DISK > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)