cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Kumar Anaparti <sureshkumar.anapa...@gmail.com>
Subject Re: Help! Jobs stuck in pending state
Date Thu, 24 Jan 2019 18:33:04 GMT
Hi Alireza,

Tables details below as per my knowledge. @Dev Please correct if any detail
is wrong.

- sync_queue and sync_queue_item tables are used for handling the entity
(VM, host, etc) queues and concurrent control. Mainly, all the VM sync jobs
pass through this queuing.
- async_job - all the async jobs and related place holder VM async jobs (if
any).
- vm_work_job - extension to place holder VM async job in async_job, which
holds VM id and the job stage.
- op_ha_work - holds the VM work items to perform HA on the VMs, scheduled
or cancelled based on the VM state.
- op_lock - Used to acquire lock on a record in the given table (key:
<tablename> + <entityid>) for a transaction by a running thread in the
Management Server. Lock is released once the transaction is completed and
corresponding record will be deleted.

Hope this helps!

-Suresh

On Thu, Jan 24, 2019 at 12:49 AM Alireza Eskandari <astro.alireza@gmail.com>
wrote:

> Dear Suresh and Andrei
> Thanks for your help.
> I have upgrade CloudStack from 4.9.3 to 4.11.2 but the problem still
> persists.
> Then I inspect database tables and I found that these three tables could be
> the root cause:
> - op_ha_work
> - op_lock
> - vm_work_job
> So I delete all records in those tables and problem solved.
> The content of those tables are submitted as a comment in the bug report in
> jira:
> https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> Suresh, could you tell me more about the role of those tables in CS?
> I think CS had been more sensitive about concurrent jobs. Previous versions
> works better.
> Regards
>
> On Wed, Jan 23, 2019 at 9:43 PM Suresh Kumar Anaparti <
> sureshkumar.anaparti@gmail.com> wrote:
>
> > Hi Alireza,
> >
> > *sync_queue *table is the actual VM sync queue which holds a queue id for
> > each VM (*sync_objtype*: VmWorkJobQueue, *sync_objid*: <VM-Id>) and the
> VM
> > jobs would reside in *sync_queue_item* table against that queue id. Only
> > one running job is allowed per VM queue (*queue_size_limit*: 1 in
> > *sync_queue* table). The active/running job would have the
> *queue_proc_id*,
> > *queue_proc_number* and *queue_proc_time* set in the *sync_queue_item*
> > table
> > and the rest jobs with that queue id would be waiting for active job to
> > complete. So, to delete pending jobs, records in the *sync_queue_item
> > *table
> > has to be cleared for the respective VMs, not the *sync_queue *table.
> >
> > I think, in your case, snapshots is taking long time and other jobs in
> that
> > VM are pending for long time as they are in queue waiting for snapshot
> job
> > to complete. What are the config values set for
> > "job.cancel.threshold.minutes", "job.expire.minutes" and
> > "volume.snapshot.job.cancel.threshold"? Are the jobs cancelled after the
> > threshold time?
> >
> > Thanks,
> > Suresh
> >
> > On Wed, Jan 23, 2019 at 7:14 PM Andrei Mikhailovsky
> > <andrei@arhont.com.invalid> wrote:
> >
> > > Hi
> > >
> > > I've had this issue a few times in 2018 and managed to get it fixed
> > pretty
> > > easily, although had spent a number of hours initially trying to figure
> > out
> > > WTF is going on. This issue looks like one of those artefacts that
> > creeped
> > > up in one of the versions released in 2018 and hasn't been addressed by
> > the
> > > dev team.
> > >
> > > The way I fixed it was similar to what has been recommended earlier.
> > > However, the difference was that I am sure I've looked at more tables
> > than
> > > just the two suggested. Basically, I've stopped the management server,
> > > created the sql backup, connected to the sql db and listed all tables.
> > > Grepped for the words like job/schedule/queue/sync. After that I've
> went
> > > through all the tables and pretty much removed all the past / active /
> > > awaiting execution jobs. I have started by looking at the vm related
> jobs
> > > (the vm that I've tried to start but wasn't able to). This has worked
> > once,
> > > but the second time I had to remove a lot more jobs which relate to
> other
> > > vms. After that I've started the management server and all went well
> from
> > > there.
> > >
> > > What I have also noticed is that my snapshot jobs (I use KVM and Ceph)
> > > seem to be blocking jobs on the hypervisor hosts which are running
> these
> > > snapshots. So, if I am trying to perform various vm related jobs on a
> > host
> > > server which is currently running a snapshot process, that job will not
> > be
> > > executed until the snapshot process is done. I've tested this countless
> > > number of times and it's still the case. Again, this issued appeared in
> > one
> > > of the 2018 releases as I've never seen between 2012 - 2017.
> > >
> > > Both issues are annoying as hell!
> > >
> > > Cheers
> > >
> > > ----- Original Message -----
> > > > From: "Alireza Eskandari" <astro.alireza@gmail.com>
> > > > To: "dev" <dev@cloudstack.apache.org>
> > > > Sent: Wednesday, 23 January, 2019 12:40:48
> > > > Subject: Re: Help! Jobs stuck in pending state
> > >
> > > > I'm following this issue in github:
> > > > https://github.com/apache/cloudstack/issues/3104
> > > > Please leave your comments
> > > > Thanks
> > > >
> > > > On Wed, Jan 23, 2019 at 12:39 PM Wei ZHOU <ustcweizhou@gmail.com>
> > wrote:
> > > >
> > > >> Hi Alireza,
> > > >>
> > > >> could you try again after restarting mgt server ?
> > > >>
> > > >> -Wei
> > > >>
> > > >> Alireza Eskandari <astro.alireza@gmail.com> 于2019年1月23日周三
上午6:22写道:
> > > >>
> > > >> > First I deleted two jobs which was existed in  vm_work_job table
> and
> > > its
> > > >> > related entry in  sync_queue table but it doesn't help.
> > > >> > Then I delete all the entries in sync_queue tables and again
no
> > > success.
> > > >> > Any idea?
> > > >> >
> > > >> > On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU <ustcweizhou@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > If you know the instance id and mysql password, it should
work
> > after
> > > >> > > removing some records in mysql.
> > > >> > >
> > > >> > > ```
> > > >> > > set @id=XXXXX;
> > > >> > >
> > > >> > > delete from vm_work_job where vm_instance_id=@id;
> > > >> > > delete from sync_queue where sync_objid=@id;
> > > >> > > ```
> > > >> > >
> > > >> > > Alireza Eskandari <astro.alireza@gmail.com> 于2019年1月22日周二
> > > 下午10:59写道:
> > > >> > >
> > > >> > > > Hi guys
> > > >> > > > I have opened a bug in jira about my problem in CS:
> > > >> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> > > >> > > > CloudStack doesn't process jobs! My cloud in totally
unusable.
> > > >> > > > Thanks in advance for you help.
> > > >> > > >
> > > >> > >
> > > >> >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message