From reviews-return-85117-archive-asf-public=cust-asf.ponee.io@mesos.apache.org Wed Feb 6 02:01:45 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3CEB3180608 for ; Wed, 6 Feb 2019 03:01:45 +0100 (CET) Received: (qmail 48051 invoked by uid 500); 6 Feb 2019 02:01:44 -0000 Mailing-List: contact reviews-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@mesos.apache.org Delivered-To: mailing list reviews@mesos.apache.org Received: (qmail 48027 invoked by uid 99); 6 Feb 2019 02:01:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2019 02:01:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 44B9DCD120; Wed, 6 Feb 2019 02:01:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.152 X-Spam-Level: *** X-Spam-Status: No, score=3.152 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, FORGED_GMAIL_RCVD=1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_MED=-2.3] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id fDRHf_AQ6g2P; Wed, 6 Feb 2019 02:01:41 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id A99EA5F5AB; Wed, 6 Feb 2019 02:01:40 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 008BAE0141; Wed, 6 Feb 2019 02:01:40 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id E04ECC4006E; Wed, 6 Feb 2019 02:01:39 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============2115537444877588767==" MIME-Version: 1.0 Subject: Re: Review Request 69892: Made SLRP recover node-published volumes after reboot. From: James DeFelice To: Benjamin Bannier , Jie Yu , James DeFelice Cc: Chun-Hung Hsiao , Mesos Reviewbot Windows , mesos Date: Wed, 06 Feb 2019 02:01:39 -0000 Message-ID: <20190206020139.25681.48395@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: James DeFelice X-ReviewGroup: mesos X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/69892/ X-Sender: James DeFelice References: <20190205174134.25681.20825@reviews-vm2.apache.org> In-Reply-To: <20190205174134.25681.20825@reviews-vm2.apache.org> Reply-To: James DeFelice X-ReviewRequest-Repository: mesos --===============2115537444877588767== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Feb. 5, 2019, 5:41 p.m., Benjamin Bannier wrote: > > src/csi/state.proto > > Lines 62-67 (original), 62-77 (patched) > > > > > > Any reason we cannot use a single field containing the `bootId` of the last transition? A single field would cut down on the number of possible message permutations, and also allow simpler handling (branching a changed `boot_id`, triggering `state`-dependent handling). We could set such a `boot_id` whenever there is a state transition. > > Chun-Hung Hsiao wrote: > Consider the following scenario: > `CREATED` -> `NODE_READY` -> `VOL_READY` -> `PUBLISHED` -> reboot -> `VOL_READY` -> reboot > If we share the same `boot_id` for all transitions, we won't be able to tell that this volume has been published before. > If we dedicate `boot_id` to `PUBLISHED`, we won't be able to know that there has been a reboot after the last `VOL_READY` so we need to call `NodeStageVolume` again. > > Chun-Hung Hsiao wrote: > After an offline discussion, we decided to simplify the state machine, and SLRP will try to bring a volume to `PUBLISHED` during recovery as long as it has ever been in `VOL_READY` before. This would mean that a misconfiguration that makes a plugin succeed on `NodeStageVolume` but fail on `NodePublishVolume` will make the SLRP unable destroy the persisten volume, even if no data have ever been written to it. How would an operator recover from such a sitution? - James ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69892/#review212557 ----------------------------------------------------------- On Feb. 5, 2019, 7:40 a.m., Chun-Hung Hsiao wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69892/ > ----------------------------------------------------------- > > (Updated Feb. 5, 2019, 7:40 a.m.) > > > Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu. > > > Bugs: MESOS-9544 > https://issues.apache.org/jira/browse/MESOS-9544 > > > Repository: mesos > > > Description > ------- > > If a CSI volume has been node-published before a reboot, SLRP will now > try to bring it back to node-published again. This is important to > perform synchronous persistent volume cleanup for `DESTROY`. > > To achieve this, in addition to keeping track of the boot ID when a CSI > volume is node-staged in `VolumeState.vol_ready_boot_id` (formerly > `VolumeState.boot_id`), SLRP now also keeps track of the boot ID when > the volume is node-published. This helps SLRP to better determine if a > volume has been published before reboot. > > > Diffs > ----- > > src/csi/state.proto 264a5657dd37605a6f3bdadd0e8d18ba9673191a > src/resource_provider/storage/provider.cpp d6e20a549ede189c757ae3ae922ab7cb86d2be2c > src/tests/storage_local_resource_provider_tests.cpp e8ed20f818ed7f1a3ce15758ea3c366520443377 > > > Diff: https://reviews.apache.org/r/69892/diff/1/ > > > Testing > ------- > > `make check` > > Testing for publish failures will be done later in chain. > > > Thanks, > > Chun-Hung Hsiao > > --===============2115537444877588767==--