From dev-return-112419-archive-asf-public=cust-asf.ponee.io@cloudstack.apache.org Thu Jan 24 19:43:56 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4FEC218062C for ; Thu, 24 Jan 2019 19:43:55 +0100 (CET) Received: (qmail 47084 invoked by uid 500); 24 Jan 2019 18:43:47 -0000 Mailing-List: contact dev-help@cloudstack.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cloudstack.apache.org Delivered-To: mailing list dev@cloudstack.apache.org Received: (qmail 45792 invoked by uid 99); 24 Jan 2019 18:43:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jan 2019 18:43:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 04AB7C0DD5 for ; Thu, 24 Jan 2019 18:43:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.797 X-Spam-Level: * X-Spam-Status: No, score=1.797 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id UEyAIkWeqLNE for ; Thu, 24 Jan 2019 18:43:43 +0000 (UTC) Received: from mail-wm1-f66.google.com (mail-wm1-f66.google.com [209.85.128.66]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 239606103D for ; Thu, 24 Jan 2019 18:33:17 +0000 (UTC) Received: by mail-wm1-f66.google.com with SMTP id d15so4256911wmb.3 for ; Thu, 24 Jan 2019 10:33:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=IaK1MN+fZ7Nv38xo5BzYqi19T5wWDZyAT5+NGj/gp78=; b=LfiZAsAl6ROKjH/kRZp3C4a8FKFctkj3kK5kP2bPx7pCAqDkSkNgEtFFlL7ZkacuNH m46HJvuvpOtw27dRmcM/j9dhNoA0DstDfGTn+xP14rtAZ/qN9nILr9n1bFiR4Xw/02MZ aaYdTOEtBXXWHfnHoxKbfV/2zVVKNOioYhFIGCN2aOCJ/VYpVfr0MJ90sOSQNswDzBVL NWBblIxQZhvQbaQlga7RE1X0oEcyApK4bDQSGk3FMi2EwyiJZFQg1C/EzfdHfDc7mTQS bmqTBFyAtuu5LWoqUYSbjeu5YHPuVLh5j449riZp3pL4dJsFoHZHvHi/HdE/RX7COzDM oPQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=IaK1MN+fZ7Nv38xo5BzYqi19T5wWDZyAT5+NGj/gp78=; b=XqBHC9xh8j9JRNoupPwRAh0UzxjjG9Hvw4P32QW+FvStzhpjpItZDMmQfipjzSxzPh iIhBys06FGdDS+s+2vCC48KLIoNy8baTTtDJHD8G4zdPNXIOFJW1N+GEiQ6wPm8l4RmX y++50WnU7RHVSICq9uk/yGFtFeUE8gxOMQm64YkQT7P1XdESr1Kk8tjPQg5LFvhtlE77 R+OT/JfsS3B1wZU5iMVXR1xjvVHRg+/Gru3EMqzPLj/VuMz2AU5fCpsWE+W5yKWwPGrI e1kAeu6s90FE0UA2uju2C+CShXHc6fajYN1fdetooA7yfXM8EvvCQSzV5NRr7nY1Rqix yvbA== X-Gm-Message-State: AJcUukfRxN5ypjpnqmDgnkJ1krYYYD41+wTa8Xd/sPsP6H/C+lHfQm99 6yiuWNHsurWKGzrGfP7zvRiB3G2cGv6komz86O46NlwYjbs= X-Google-Smtp-Source: ALg8bN5xY5E26eJLiYATezLGvq0gdVe1WqcePscAMlrIzY1jdO7lcBGag/eOTstsCsePvwdJTaGLuvaTZUo+k06qAJ8= X-Received: by 2002:a1c:1f54:: with SMTP id f81mr4057751wmf.6.1548354795706; Thu, 24 Jan 2019 10:33:15 -0800 (PST) MIME-Version: 1.0 References: <643723482.244007.1548251062995.JavaMail.zimbra@arhont.com> In-Reply-To: From: Suresh Kumar Anaparti Date: Fri, 25 Jan 2019 00:03:04 +0530 Message-ID: Subject: Re: Help! Jobs stuck in pending state To: dev Content-Type: multipart/alternative; boundary="000000000000959fd605803871a3" --000000000000959fd605803871a3 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Alireza, Tables details below as per my knowledge. @Dev Please correct if any detail is wrong. - sync_queue and sync_queue_item tables are used for handling the entity (VM, host, etc) queues and concurrent control. Mainly, all the VM sync jobs pass through this queuing. - async_job - all the async jobs and related place holder VM async jobs (if any). - vm_work_job - extension to place holder VM async job in async_job, which holds VM id and the job stage. - op_ha_work - holds the VM work items to perform HA on the VMs, scheduled or cancelled based on the VM state. - op_lock - Used to acquire lock on a record in the given table (key: + ) for a transaction by a running thread in the Management Server. Lock is released once the transaction is completed and corresponding record will be deleted. Hope this helps! -Suresh On Thu, Jan 24, 2019 at 12:49 AM Alireza Eskandari wrote: > Dear Suresh and Andrei > Thanks for your help. > I have upgrade CloudStack from 4.9.3 to 4.11.2 but the problem still > persists. > Then I inspect database tables and I found that these three tables could = be > the root cause: > - op_ha_work > - op_lock > - vm_work_job > So I delete all records in those tables and problem solved. > The content of those tables are submitted as a comment in the bug report = in > jira: > https://issues.apache.org/jira/browse/CLOUDSTACK-10401 > Suresh, could you tell me more about the role of those tables in CS? > I think CS had been more sensitive about concurrent jobs. Previous versio= ns > works better. > Regards > > On Wed, Jan 23, 2019 at 9:43 PM Suresh Kumar Anaparti < > sureshkumar.anaparti@gmail.com> wrote: > > > Hi Alireza, > > > > *sync_queue *table is the actual VM sync queue which holds a queue id f= or > > each VM (*sync_objtype*: VmWorkJobQueue, *sync_objid*: ) and the > VM > > jobs would reside in *sync_queue_item* table against that queue id. Onl= y > > one running job is allowed per VM queue (*queue_size_limit*: 1 in > > *sync_queue* table). The active/running job would have the > *queue_proc_id*, > > *queue_proc_number* and *queue_proc_time* set in the *sync_queue_item* > > table > > and the rest jobs with that queue id would be waiting for active job to > > complete. So, to delete pending jobs, records in the *sync_queue_item > > *table > > has to be cleared for the respective VMs, not the *sync_queue *table. > > > > I think, in your case, snapshots is taking long time and other jobs in > that > > VM are pending for long time as they are in queue waiting for snapshot > job > > to complete. What are the config values set for > > "job.cancel.threshold.minutes", "job.expire.minutes" and > > "volume.snapshot.job.cancel.threshold"? Are the jobs cancelled after th= e > > threshold time? > > > > Thanks, > > Suresh > > > > On Wed, Jan 23, 2019 at 7:14 PM Andrei Mikhailovsky > > wrote: > > > > > Hi > > > > > > I've had this issue a few times in 2018 and managed to get it fixed > > pretty > > > easily, although had spent a number of hours initially trying to figu= re > > out > > > WTF is going on. This issue looks like one of those artefacts that > > creeped > > > up in one of the versions released in 2018 and hasn't been addressed = by > > the > > > dev team. > > > > > > The way I fixed it was similar to what has been recommended earlier. > > > However, the difference was that I am sure I've looked at more tables > > than > > > just the two suggested. Basically, I've stopped the management server= , > > > created the sql backup, connected to the sql db and listed all tables= . > > > Grepped for the words like job/schedule/queue/sync. After that I've > went > > > through all the tables and pretty much removed all the past / active = / > > > awaiting execution jobs. I have started by looking at the vm related > jobs > > > (the vm that I've tried to start but wasn't able to). This has worked > > once, > > > but the second time I had to remove a lot more jobs which relate to > other > > > vms. After that I've started the management server and all went well > from > > > there. > > > > > > What I have also noticed is that my snapshot jobs (I use KVM and Ceph= ) > > > seem to be blocking jobs on the hypervisor hosts which are running > these > > > snapshots. So, if I am trying to perform various vm related jobs on a > > host > > > server which is currently running a snapshot process, that job will n= ot > > be > > > executed until the snapshot process is done. I've tested this countle= ss > > > number of times and it's still the case. Again, this issued appeared = in > > one > > > of the 2018 releases as I've never seen between 2012 - 2017. > > > > > > Both issues are annoying as hell! > > > > > > Cheers > > > > > > ----- Original Message ----- > > > > From: "Alireza Eskandari" > > > > To: "dev" > > > > Sent: Wednesday, 23 January, 2019 12:40:48 > > > > Subject: Re: Help! Jobs stuck in pending state > > > > > > > I'm following this issue in github: > > > > https://github.com/apache/cloudstack/issues/3104 > > > > Please leave your comments > > > > Thanks > > > > > > > > On Wed, Jan 23, 2019 at 12:39 PM Wei ZHOU > > wrote: > > > > > > > >> Hi Alireza, > > > >> > > > >> could you try again after restarting mgt server ? > > > >> > > > >> -Wei > > > >> > > > >> Alireza Eskandari =E4=BA=8E2019=E5=B9=B4= 1=E6=9C=8823=E6=97=A5=E5=91=A8=E4=B8=89 =E4=B8=8A=E5=8D=886:22=E5=86=99=E9= =81=93=EF=BC=9A > > > >> > > > >> > First I deleted two jobs which was existed in vm_work_job table > and > > > its > > > >> > related entry in sync_queue table but it doesn't help. > > > >> > Then I delete all the entries in sync_queue tables and again no > > > success. > > > >> > Any idea? > > > >> > > > > >> > On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU > > > wrote: > > > >> > > > > >> > > If you know the instance id and mysql password, it should work > > after > > > >> > > removing some records in mysql. > > > >> > > > > > >> > > ``` > > > >> > > set @id=3DXXXXX; > > > >> > > > > > >> > > delete from vm_work_job where vm_instance_id=3D@id; > > > >> > > delete from sync_queue where sync_objid=3D@id; > > > >> > > ``` > > > >> > > > > > >> > > Alireza Eskandari =E4=BA=8E2019=E5= =B9=B41=E6=9C=8822=E6=97=A5=E5=91=A8=E4=BA=8C > > > =E4=B8=8B=E5=8D=8810:59=E5=86=99=E9=81=93=EF=BC=9A > > > >> > > > > > >> > > > Hi guys > > > >> > > > I have opened a bug in jira about my problem in CS: > > > >> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10401 > > > >> > > > CloudStack doesn't process jobs! My cloud in totally unusabl= e. > > > >> > > > Thanks in advance for you help. > > > >> > > > > > > >> > > > > > >> > > > > > > > --000000000000959fd605803871a3--