From issues-return-45963-archive-asf-public=cust-asf.ponee.io@mesos.apache.org Sat Jan 20 02:32:09 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id A9F1D180607 for ; Sat, 20 Jan 2018 02:32:09 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 978E3160C4A; Sat, 20 Jan 2018 01:32:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9111A160C36 for ; Sat, 20 Jan 2018 02:32:08 +0100 (CET) Received: (qmail 80771 invoked by uid 500); 20 Jan 2018 01:32:07 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 80762 invoked by uid 99); 20 Jan 2018 01:32:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Jan 2018 01:32:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3F7B11808CF for ; Sat, 20 Jan 2018 01:32:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -107.909 X-Spam-Level: X-Spam-Status: No, score=-107.909 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id KIh9p4z67gxV for ; Sat, 20 Jan 2018 01:32:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 4AC405FB37 for ; Sat, 20 Jan 2018 01:32:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 72384E0F77 for ; Sat, 20 Jan 2018 01:32:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id BDB1B241C8 for ; Sat, 20 Jan 2018 01:32:00 +0000 (UTC) Date: Sat, 20 Jan 2018 01:32:00 +0000 (UTC) From: "Gilbert Song (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-8161) Potentially dangerous dangling mount when stopping task with persistent volume MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333107#comment-16333107 ] Gilbert Song commented on MESOS-8161: ------------------------------------- [~zhitao] , any update on this issue? Could we close it if it is not a Mesos issue? I am updating the priority to `Major` for now. Please change it back if necessary. > Potentially dangerous dangling mount when stopping task with persistent volume > ------------------------------------------------------------------------------ > > Key: MESOS-8161 > URL: https://issues.apache.org/jira/browse/MESOS-8161 > Project: Mesos > Issue Type: Bug > Reporter: Zhitao Li > Priority: Critical > > While we fixed a case in MESOS-7366 when an executor terminates, it seems like a very similar case can still happen if a task with a persistent volume terminates, executor still active, and [this unmount call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] fails due to "device busy". > I believe if agent gc or something other things run on the host mount namespace, it is possible to lose persistent volume data because of this. > Agent log: > {code:none} > I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0235559588 > 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929 > I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-02355595888 > f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0 > 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6 > 1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929 > I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f o > f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6 > f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61 > f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050 > I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0235559588 > 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929 > I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43046 with User-Agent='Python-urllib/2.7' > I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43144 with User-Agent='Python-urllib/2.7' > I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283560425078715days > I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7' > I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:44732 with User-Agent='Python-urllib/2.7' > I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:56414 with User-Agent='filebundle-agent' > I1101 20:20:07.913359 102216 status_update_manager.cpp:395] Received status update acknowledgement (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888 > f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:20:07.913455 102216 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0 > 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:20:14.135632 102231 slave.cpp:3634] Handling status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f > 6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929 > E1101 20:20:14.136687 102211 slave.cpp:6736] Unexpected terminal task state TASK_ERROR > I1101 20:20:14.137081 102230 linux.cpp:627] Removing mount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_ex > ecutor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume' for persistent volume disk(cassandra-test-varung5-framework, cassandra, {resource_id: 411a63af-0fea-4 > d2d-b850-a77039756e99})[a24515db-5b6b-4538-b1da-5c1acf0fe286:volume]:2000000 of container da3ecaac-35dc-4464-8204-76577dcde2a8 > I1101 20:20:14.137557 102241 disk.cpp:207] Updating the disk resources for container da3ecaac-35dc-4464-8204-76577dcde2a8 to cpus(cassandra-test-varung5-framework, cassandra, {resource_id: 58e9c94 > c-7512-45d2-bf7e-453ec42b55bf}):0.1; mem(cassandra-test-varung5-framework, cassandra, {resource_id: 57c52b95-5816-43cd-b8c4-8a46d4271788}):768; ports(cassandra-test-varung5-framework, cassandra, { > resource_id: 6ed4bec0-b7bf-4513-8c65-197f6e3e1a44}):[31001-31001] > I1101 20:20:14.137765 102241 disk.cpp:312] Checking disk usage at '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/no > de-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume' for container da3ecaac-35dc-4464-8204-76577dcde2a8 has been cancelled > I1101 20:20:14.137763 102227 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 768MB for container da3ecaac-35dc-4464-8204-76577dcde2a8 > E1101 20:20:14.138339 102212 slave.cpp:3903] Failed to update resources for container da3ecaac-35dc-4464-8204-76577dcde2a8 of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' running task node-1__23fa9624-4608-404f-8d6f-02355595888f on status update for terminal task, destroying container: Collect failed: Failed to unmount unneeded persistent volume at '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume': Failed to unmount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume': Device or resource busy > I1101 20:20:14.138587 102220 status_update_manager.cpp:323] Received status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:20:14.138608 102233 containerizer.cpp:1955] Destroying container da3ecaac-35dc-4464-8204-76577dcde2a8 in RUNNING state > I1101 20:20:14.138664 102220 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:20:14.138756 102232 linux_launcher.cpp:498] Asked to destroy container da3ecaac-35dc-4464-8204-76577dcde2a8 > I1101 20:20:14.138847 102214 slave.cpp:4051] Forwarding the update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050 > I1101 20:20:14.138978 102214 slave.cpp:3961] Sending acknowledgement for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929 > I1101 20:20:14.139230 102232 linux_launcher.cpp:541] Using freezer to destroy cgroup mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 > I1101 20:20:14.139691 102227 cpu.cpp:101] Updated 'cpu.shares' to 102 (cpus 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8 > I1101 20:20:14.140110 102235 cgroups.cpp:2705] Freezing cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 > I1101 20:20:14.141026 102227 cpu.cpp:121] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8 > I1101 20:20:14.243934 102236 cgroups.cpp:1439] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 103.768064ms > I1101 20:20:14.246510 102236 cgroups.cpp:2723] Thawing cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 > I1101 20:20:14.249119 102218 cgroups.cpp:1468] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 2.564096ms > I1101 20:20:14.945940 102213 slave.cpp:4179] Got exited event for executor(1)@10.70.142.140:36929 > I1101 20:20:18.942217 102236 http.cpp:277] HTTP GET for /slave(1)/state from 10.70.142.140:52050 with User-Agent='python-requests/2.4.3 CPython/2.7.10 Linux/4.4.82' > I1101 20:20:20.412132 102235 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > I1101 20:20:20.412247 102235 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 > W1101 20:20:20.665359 102213 slave.cpp:2669] Ignoring updating pid for framework db61f6d4-fd0f-48be-927d-14282c12301f-0005 because it does not exist > I1101 20:20:28.572569 102237 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:48290 with User-Agent='Go-http-client/1.1' > W1101 20:20:28.573552 102215 containerizer.cpp:1876] Skipping status for container da3ecaac-35dc-4464-8204-76577dcde2a8 because: Container does not exist > I1101 20:20:28.619093 102233 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:48292 with User-Agent='Go-http-client/1.1' > I1101 20:20:37.789496 102218 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:58214 with User-Agent='filebundle-agent' > I1101 20:21:01.325791 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55670 with User-Agent='Python-urllib/2.7' > I1101 20:21:01.333392 102225 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:55691 with User-Agent='Python-urllib/2.7' > I1101 20:21:01.335985 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55704 with User-Agent='Python-urllib/2.7' > I1101 20:21:01.428691 102241 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55980 with User-Agent='Python-urllib/2.7' > I1101 20:21:02.247719 102214 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283563174420139days > I1101 20:21:07.789693 102233 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44932 with User-Agent='filebundle-agent' > I1101 20:21:09.948168 102236 http.cpp:277] HTTP GET for /slave(1)/state from 10.70.142.140:51804 with User-Agent='python-requests/2.4.3 CPython/2.7.10 Linux/4.4.82' > E1101 20:21:14.140836 102231 slave.cpp:4520] Termination of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 failed: Failed to kill all processes in the container: Timed out after 1mins > I1101 20:21:14.140986 102231 slave.cpp:4646] Cleaning up executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 at executor(1)@10.70.142.140:36929 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)