Return-Path: X-Original-To: apmail-cloudstack-dev-archive@www.apache.org Delivered-To: apmail-cloudstack-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D39E1166E for ; Tue, 15 Jul 2014 19:13:17 +0000 (UTC) Received: (qmail 81574 invoked by uid 500); 15 Jul 2014 19:13:16 -0000 Delivered-To: apmail-cloudstack-dev-archive@cloudstack.apache.org Received: (qmail 81531 invoked by uid 500); 15 Jul 2014 19:13:16 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 81518 invoked by uid 99); 15 Jul 2014 19:13:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jul 2014 19:13:16 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Chiradeep.Vittal@citrix.com designates 66.165.176.63 as permitted sender) Received: from [66.165.176.63] (HELO SMTP02.CITRIX.COM) (66.165.176.63) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jul 2014 19:13:12 +0000 X-IronPort-AV: E=Sophos;i="5.01,667,1400025600"; d="scan'208,217";a="153131111" Received: from sjcpex01cl03.citrite.net ([10.216.14.145]) by FTLPIPO02.CITRIX.COM with ESMTP/TLS/AES128-SHA; 15 Jul 2014 19:12:50 +0000 Received: from SJCPEX01CL02.citrite.net ([169.254.2.117]) by SJCPEX01CL03.citrite.net ([10.216.14.145]) with mapi id 14.03.0181.006; Tue, 15 Jul 2014 12:12:46 -0700 From: Chiradeep Vittal To: "dev@cloudstack.apache.org" , =?iso-8859-2?Q?Tomasz_Zi=EAba?= CC: Marcus Sorensen , Damoder Reddy , Alex Huang Subject: Re: vms stopped while restarted by user Thread-Topic: vms stopped while restarted by user Thread-Index: AQHPoDD18yjh0a1lSkO4UuXmkOYX5ZuhgR+A Date: Tue, 15 Jul 2014 19:12:45 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.4.2.140509 x-originating-ip: [10.13.107.78] Content-Type: multipart/alternative; boundary="_000_CFEACB1E49A72chiradeepvittalcitrixcom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_CFEACB1E49A72chiradeepvittalcitrixcom_ Content-Type: text/plain; charset="iso-8859-2" Content-Transfer-Encoding: quoted-printable Agree. Not sure why your system is so slow, but these parameters should hel= p From: Daan Hoogland > Reply-To: "dev@cloudstack.apache.org" > Date: Tuesday, July 15, 2014 at 6:29 AM To: Tomasz Zi=EAba > Cc: "dev@cloudstack.apache.org" >, Marcus Sorensen >, Damoder Reddy > Subject: vms stopped while restarted by user Tomasz, I can only fantasize on the full rationale of the implementation of the retry but in general it makes sense to me. A job has a time to try and a times tried field. the worker manager has time to sleep and max retries. As you can see below these are read from the configuration: value =3D params.get("time.to.sleep"); _timeToSleep =3D NumbersUtil.parseInt(value, 60) * 1000; value =3D params.get("max.retries"); _maxRetries =3D NumbersUtil.parseInt(value, 5); there is also value =3D params.get("stop.retry.interval"); _stopRetryInterval =3D NumbersUtil.parseInt(value, 10 * 60); The time.to.sleep and stop.retry.interval seem to jointly explain the ten minute scenario you described in the bug report. They don't do completely as some of the handling of the values is based on bitshifting and not on datetime calculus (using mixed factors of 1000,60,60,24 and 365.25) You can try and play with those to tune your setting. In any case looking at the vm to decide to restart the vm is not usefull as Cloudstack will do some cleanup after stopping the instance. You should really wait untill cloudstack reports on the job with either succes or error. On Tue, Jul 15, 2014 at 3:12 PM, Tomasz Zi=EAba > wrote: Hello, The user does not receive confirmation of the operation. >From the point of view of user input it looks like the machine itself stopped. As you can see in the logs, the ACS explicitly sends stop command, as if they press the Stop button from the GUI, so it is aware of the action from the perspective of the ACS / MS. I can not point out which component may be responsible for it. We have tried to analyze the code to understand what is happening, but the part of the code related to HAWorker is not very clear. Unfortunately we could not find online any assumptions on the level of architecture / design of HAWorker. Maybe method of small steps help find a solution. First a small question: why HAWorker performs reschedule. What was the idea for such action. 2014-07-15 14:26 GMT+02:00 Daan Hoogland >: Tomasz, As I understand the issue this is what happens: The user stops the vm from the UI The MS sends the stop command to the machine The machine stops and takes a long time for it The MS reschedules the stop Then machine stops the user starts the machine the MS get by stopping the machine Did the user ever get a confirmation that the machine was stopped or that stopping failed? If so, this is the bug, as it seems the MS works as designed. Don't get me wrong; I am trying to figure out a path to a solution for you. I am not convinced there is a bug in the management server though. That doesn't mean it can be in cloudstack over all. Either at a design level or for instance in some inter-process communication. kind regards, Daan Hoogland On Fri, Jul 11, 2014 at 2:45 PM, Tomasz Zi=EAba > wrote: > Hello, > > We are waiting for the patch with longingly. > > Error associated with self-closing of machines causes very serious > complications, both from the technical (users need to wait for 10 > minutes > and check if the machine is not closed automatically) as well as the > business side (this problem does not look very professional from the > user > side) > > Given that: > - An error has been detected in February so 5 months ago, > - in earlier versions (3.0.2) error does not exists, > - there is a procedure to reproduce this error, > > we would be very grateful if this issue will be resolved in ACS4.4. > > > -- > Regards, > Tomasz Zi=EAba > Twitter: @TZieba > LinkedIn: pl.linkedin.com/pub/tomasz-zi=EAba-ph-d/3b/7a8/ab6/ > -- Daan --_000_CFEACB1E49A72chiradeepvittalcitrixcom_--