From dev-return-111221-archive-asf-public=cust-asf.ponee.io@cloudstack.apache.org Thu Apr 5 00:07:00 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 6A70C18064F for ; Thu, 5 Apr 2018 00:06:59 +0200 (CEST) Received: (qmail 70460 invoked by uid 500); 4 Apr 2018 22:06:49 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 69397 invoked by uid 99); 4 Apr 2018 22:06:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Apr 2018 22:06:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id AD571180677 for ; Wed, 4 Apr 2018 22:06:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.898 X-Spam-Level: * X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id EGSXCiGfw0uK for ; Wed, 4 Apr 2018 22:06:46 +0000 (UTC) Received: from mail-io0-f179.google.com (mail-io0-f179.google.com [209.85.223.179]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3B29E5FAD2 for ; Wed, 4 Apr 2018 22:06:45 +0000 (UTC) Received: by mail-io0-f179.google.com with SMTP id v13so28175401iob.6 for ; Wed, 04 Apr 2018 15:06:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=coy4lROobFTfk5YDeySQroRkvPL3x0c5T+NMtJoLxxQ=; b=J8Ta15EEthOJH2sGR9FIpTsEofDquj9D00tyEH+Ld8yRRLtPdnOg3xK8HTghYIArBw zzwthN2Z2XLndmonypC5D3FD/L7laHvjwVBf62ltoiQO8rMWBAQURjcVV2yZww3TVs2q SL5NY6MKAj9+tMjoV8+c9ij0PqiYcQBLdJst3h1ZqU1pj1WWQfOQEbrUilY94Jw9yTqx etwRWF4x+WB+Fi9MtbnyAzrSXm0aYO2H+DFQxhJtzbkpvzNWvl8BZ0iRwTfHblFHf+8I q5jETjAG1YaPSFVN6VA1zy9TArmKR+deHXN1ALhUFJ9YE53guz/7k9hqo9zuaUYMWomd DIXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=coy4lROobFTfk5YDeySQroRkvPL3x0c5T+NMtJoLxxQ=; b=WarYBvxc+uloiA8OVpRjbirF9KeKEuvYohsHSpfy9pV+p5WbFRslwlDFpaILu0BkGE ig2x8aEojs6y5+WC9Igg7mfhiXdi6jjsUjy53hjHfDtXV7FUPUsEgw6JxpXJ6gQYLWKr c+vNbUFHqEe4fCFSN3+E3JZx5FXU6Bte3OJPHWpjt/1paX0oKohsSD2CGQduvMKj5IbW e8dem/w8pueTz+u5PRYGGoKXNZyFSsc5tGDbqUlFzf9vOem+zi/8XpxGvbr4o3y+c5KS lscLp7/ZsnteFgj4Vcby1WCQixpjqlYhRiSQ2m9blHF87+x6UeJFeNcIfAzs+dMyTGqn zMxA== X-Gm-Message-State: AElRT7GDhDVoOr+5VYNJlHhDl0L6SAHhKV7NuBPUEC+87yGtmy4pwDpf oT+o11lQj3v1ZJQ0tQyE721HUx/ZIHWuqT0flwOKLw== X-Google-Smtp-Source: AIpwx4+4YnPY1arFPMv9Fi87ZP8wJ/d16jS+xM3SeStesK5Kn8WVlHNRvO7p4vyBqbfj3R09IqTqSEVbOOUlGOZjQts= X-Received: by 10.107.57.84 with SMTP id g81mr17878343ioa.6.1522879603825; Wed, 04 Apr 2018 15:06:43 -0700 (PDT) MIME-Version: 1.0 Received: by 10.2.168.26 with HTTP; Wed, 4 Apr 2018 15:06:43 -0700 (PDT) In-Reply-To: References: From: ilya musayev Date: Wed, 4 Apr 2018 15:06:43 -0700 Message-ID: Subject: Re: [DISCUSS] CloudStack graceful shutdown To: dev@cloudstack.apache.org Content-Type: multipart/alternative; boundary="001a114ac1e6d2625a05690d0912" --001a114ac1e6d2625a05690d0912 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Rafael > * Regarding the tasks/jobs that management servers (MSs) execute; are these tasks originate from requests that come to the MS, or is it possible that requests received by one management server to be executed by other? I mean, if I execute a request against MS1, will this request always be executed/threated by MS1, or is it possible that this request is executed by another MS (e.g. MS2)? Yes its possible, but it will be tracked under async_job with proper MS that is responsible for this task. My initial goal was to prevent user from creating more async jobs on the node thats about to go down for maintenance - but as i'm thinking about it - i dont know if it matters - since async job will be executed on the MS node that tracks a specific hypervisor/agent - as defined in cloud.host table. Maybe i'll leave off the blocking off 8080/8443 and just focus on tracking async_jobs instead. Assuming you are managing your MS with Load Balancer - it should be smart enough to shift the user traffic to MS that is up. > * I would suggest that after we block traffic coming from 8080/8443/8250(we will need to block this as well right?), we can log the execution of tasks. I mean, something saying, there are XXX tasks (enumerate tasks) still being executed, we will wait for them to finish before shutting down 8250 - is a bit too aggressive in my opinion andwe dont want to do that. If you block 8250 and you have a long running tasks - you are waiting on to complete - then it may fail - because you block agent communication on 8250= . Thanks ilya On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weing=C3=A4rtner < rafaelweingartner@gmail.com> wrote: > Big +1 for this feature; I only have a few doubts. > > * Regarding the tasks/jobs that management servers (MSs) execute; are the= se > tasks originate from requests that come to the MS, or is it possible that > requests received by one management server to be executed by other? I mea= n, > if I execute a request against MS1, will this request always be > executed/threated by MS1, or is it possible that this request is executed > by another MS (e.g. MS2)? > > * I would suggest that after we block traffic coming from 8080/8443/8250(= we > will need to block this as well right?), we can log the execution of task= s. > I mean, something saying, there are XXX tasks (enumerate tasks) still bei= ng > executed, we will wait for them to finish before shutting down. > > * The timeout (60 minutes suggested) could be global settings that we can > load before executing the graceful-shutdown. > > On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev > > wrote: > > > Use case: > > In any environment - time to time - administrator needs to perform a > > maintenance. Current stop sequence of cloudstack management server will > > ignore the fact that there may be long running async jobs - and termina= te > > the process. This in turn can create a poor user experience and > occasional > > inconsistency in cloudstack db. > > > > This is especially painful in large environments where the user has > > thousands of nodes and there is a continuous patching that happens arou= nd > > the clock - that requires migration of workload from one node to anothe= r. > > > > With that said - i've created a script that monitors the async job queu= e > > for given MS and waits for it complete all jobs. More details are poste= d > > below. > > > > I'd like to introduce "graceful-shutdown" into the systemctl/service of > > cloudstack-management service. > > > > The details of how it will work is below: > > > > Workflow for graceful shutdown: > > Using iptables/firewalld - block any connection attempts on 8080/8443 > (we > > can identify the ports dynamically) > > Identify the MSID for the node, using the proper msid - query async_j= ob > > table for > > 1) any jobs that are still running (or job_status=3D=E2=80=9C0=E2=80=9D= ) > > 2) job_dispatcher not like =E2=80=9CpseudoJobDispatcher" > > 3) job_init_msid=3D$my_ms_id > > > > Monitor this async_job table for 60 minutes - until all async jobs for > MSID > > are done, then proceed with shutdown > > If failed for any reason or terminated, catch the exit via trap > command > > and unblock the 8080/8443 > > > > Comments are welcome > > > > Regards, > > ilya > > > > > > -- > Rafael Weing=C3=A4rtner > --001a114ac1e6d2625a05690d0912--