From issues-return-45963-archive-asf-public=cust-asf.ponee.io@mesos.apache.org  Sat Jan 20 02:32:09 2018
Return-Path: <issues-return-45963-archive-asf-public=cust-asf.ponee.io@mesos.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id A9F1D180607
	for <archive-asf-public@eu.ponee.io>; Sat, 20 Jan 2018 02:32:09 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id 978E3160C4A; Sat, 20 Jan 2018 01:32:09 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id 9111A160C36
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 20 Jan 2018 02:32:08 +0100 (CET)
Received: (qmail 80771 invoked by uid 500); 20 Jan 2018 01:32:07 -0000
Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@mesos.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@mesos.apache.org>
List-Post: <mailto:issues@mesos.apache.org>
List-Id: <issues.mesos.apache.org>
Reply-To: dev@mesos.apache.org
Delivered-To: mailing list issues@mesos.apache.org
Received: (qmail 80762 invoked by uid 99); 20 Jan 2018 01:32:07 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Jan 2018 01:32:07 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3F7B11808CF
	for <issues@mesos.apache.org>; Sat, 20 Jan 2018 01:32:07 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -107.909
X-Spam-Level:
X-Spam-Status: No, score=-107.909 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	NORMAL_HTTP_TO_IP=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001,
	T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id KIh9p4z67gxV for <issues@mesos.apache.org>;
	Sat, 20 Jan 2018 01:32:04 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 4AC405FB37
	for <issues@mesos.apache.org>; Sat, 20 Jan 2018 01:32:02 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 72384E0F77
	for <issues@mesos.apache.org>; Sat, 20 Jan 2018 01:32:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id BDB1B241C8
	for <issues@mesos.apache.org>; Sat, 20 Jan 2018 01:32:00 +0000 (UTC)
Date: Sat, 20 Jan 2018 01:32:00 +0000 (UTC)
From: "Gilbert Song (JIRA)" <jira@apache.org>
To: issues@mesos.apache.org
Message-ID: <JIRA.13115509.1509581829000.43259.1516411920775@Atlassian.JIRA>
In-Reply-To: <JIRA.13115509.1509581829000@Atlassian.JIRA>
References: <JIRA.13115509.1509581829000@Atlassian.JIRA> <JIRA.13115509.1509581829905@jira-lw-us.apache.org>
Subject: [jira] [Commented] (MESOS-8161) Potentially dangerous dangling
 mount when stopping task with persistent volume
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/MESOS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333107#comment-16333107 ] 

Gilbert Song commented on MESOS-8161:
-------------------------------------

[~zhitao] , any update on this issue? Could we close it if it is not a Mesos issue?

I am updating the priority to `Major` for now. Please change it back if necessary.

> Potentially dangerous dangling mount when stopping task with persistent volume
> ------------------------------------------------------------------------------
>
>                 Key: MESOS-8161
>                 URL: https://issues.apache.org/jira/browse/MESOS-8161
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Zhitao Li
>            Priority: Critical
>
> While we fixed a case in MESOS-7366 when an executor terminates, it seems like a very similar case can still happen if a task with a persistent volume terminates, executor still active, and [this unmount call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] fails due to "device busy".
> I believe if agent gc or something other things run on the host mount namespace, it is possible to lose persistent volume data because of this.
> Agent log:
> {code:none}
> I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6
> 1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f o
> f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6
> f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61
> f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
> I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43046 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43144 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283560425078715days
> I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:44732 with User-Agent='Python-urllib/2.7'
> I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:56414 with User-Agent='filebundle-agent'
> I1101 20:20:07.913359 102216 status_update_manager.cpp:395] Received status update acknowledgement (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:07.913455 102216 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.135632 102231 slave.cpp:3634] Handling status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f
> 6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> E1101 20:20:14.136687 102211 slave.cpp:6736] Unexpected terminal task state TASK_ERROR
> I1101 20:20:14.137081 102230 linux.cpp:627] Removing mount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_ex
> ecutor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume' for persistent volume disk(cassandra-test-varung5-framework, cassandra, {resource_id: 411a63af-0fea-4
> d2d-b850-a77039756e99})[a24515db-5b6b-4538-b1da-5c1acf0fe286:volume]:2000000 of container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.137557 102241 disk.cpp:207] Updating the disk resources for container da3ecaac-35dc-4464-8204-76577dcde2a8 to cpus(cassandra-test-varung5-framework, cassandra, {resource_id: 58e9c94
> c-7512-45d2-bf7e-453ec42b55bf}):0.1; mem(cassandra-test-varung5-framework, cassandra, {resource_id: 57c52b95-5816-43cd-b8c4-8a46d4271788}):768; ports(cassandra-test-varung5-framework, cassandra, {
> resource_id: 6ed4bec0-b7bf-4513-8c65-197f6e3e1a44}):[31001-31001]
> I1101 20:20:14.137765 102241 disk.cpp:312] Checking disk usage at '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/no
> de-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume' for container da3ecaac-35dc-4464-8204-76577dcde2a8 has been cancelled
> I1101 20:20:14.137763 102227 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 768MB for container da3ecaac-35dc-4464-8204-76577dcde2a8
> E1101 20:20:14.138339 102212 slave.cpp:3903] Failed to update resources for container da3ecaac-35dc-4464-8204-76577dcde2a8 of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' running task node-1__23fa9624-4608-404f-8d6f-02355595888f on status update for terminal task, destroying container: Collect failed: Failed to unmount unneeded persistent volume at '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume': Failed to unmount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume': Device or resource busy
> I1101 20:20:14.138587 102220 status_update_manager.cpp:323] Received status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.138608 102233 containerizer.cpp:1955] Destroying container da3ecaac-35dc-4464-8204-76577dcde2a8 in RUNNING state
> I1101 20:20:14.138664 102220 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.138756 102232 linux_launcher.cpp:498] Asked to destroy container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.138847 102214 slave.cpp:4051] Forwarding the update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
> I1101 20:20:14.138978 102214 slave.cpp:3961] Sending acknowledgement for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:20:14.139230 102232 linux_launcher.cpp:541] Using freezer to destroy cgroup mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.139691 102227 cpu.cpp:101] Updated 'cpu.shares' to 102 (cpus 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.140110 102235 cgroups.cpp:2705] Freezing cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.141026 102227 cpu.cpp:121] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.243934 102236 cgroups.cpp:1439] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 103.768064ms
> I1101 20:20:14.246510 102236 cgroups.cpp:2723] Thawing cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.249119 102218 cgroups.cpp:1468] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 2.564096ms
> I1101 20:20:14.945940 102213 slave.cpp:4179] Got exited event for executor(1)@10.70.142.140:36929
> I1101 20:20:18.942217 102236 http.cpp:277] HTTP GET for /slave(1)/state from 10.70.142.140:52050 with User-Agent='python-requests/2.4.3 CPython/2.7.10 Linux/4.4.82'
> I1101 20:20:20.412132 102235 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:20.412247 102235 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> W1101 20:20:20.665359 102213 slave.cpp:2669] Ignoring updating pid for framework db61f6d4-fd0f-48be-927d-14282c12301f-0005 because it does not exist
> I1101 20:20:28.572569 102237 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:48290 with User-Agent='Go-http-client/1.1'
> W1101 20:20:28.573552 102215 containerizer.cpp:1876] Skipping status for container da3ecaac-35dc-4464-8204-76577dcde2a8 because: Container does not exist
> I1101 20:20:28.619093 102233 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:48292 with User-Agent='Go-http-client/1.1'
> I1101 20:20:37.789496 102218 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:58214 with User-Agent='filebundle-agent'
> I1101 20:21:01.325791 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55670 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.333392 102225 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:55691 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.335985 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55704 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.428691 102241 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55980 with User-Agent='Python-urllib/2.7'
> I1101 20:21:02.247719 102214 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283563174420139days
> I1101 20:21:07.789693 102233 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44932 with User-Agent='filebundle-agent'
> I1101 20:21:09.948168 102236 http.cpp:277] HTTP GET for /slave(1)/state from 10.70.142.140:51804 with User-Agent='python-requests/2.4.3 CPython/2.7.10 Linux/4.4.82'
> E1101 20:21:14.140836 102231 slave.cpp:4520] Termination of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 failed: Failed to kill all processes in the container: Timed out after 1mins
> I1101 20:21:14.140986 102231 slave.cpp:4646] Cleaning up executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 at executor(1)@10.70.142.140:36929
> {code}


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)