cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomasz Zieba (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-6036) CloudStack stops the machine for no reason
Date Thu, 24 Jul 2014 16:14:38 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073335#comment-14073335
] 

Tomasz Zieba commented on CLOUDSTACK-6036:
------------------------------------------

Hello,

After a little investigation, we have some facts. Error stopping the machine is always associated
with an entry in the logs of the content:

2014-07-24 17:00:34,918 DEBUG [vm.dao.VMInstanceDaoImpl] (HA-Worker-2:work-15) Unable to update
VM[User|Win-HD-tune]: DB Data={Host=null; State=Stopped; updated=55; time=Thu Jul 24 17:00:34
CEST 2014} New Data: {Host=2; State=Running; updated=55; time=Thu Jul 24 17:00:34 CEST 2014}
Stale Data: {Host=2; State=Stopping; updated=54; time=Thu Jul 24 17:00:13 CEST 2014}

and then:

2014-07-24 17:00:34,919 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:work-15)
Stop was unsuccessful.  Rescheduling
2014-07-24 17:00:34,919 INFO  [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:work-15)
Rescheduling HAWork[15-Stop-10-Stopping-Scheduled] to try again at Thu Jul 24 17:10:48 CEST
2014

After analyzing the logs of the database:

633 Query     UPDATE vm_instance SET vm_instance.update_time='2014-07-24 15:00:34', vm_instance.update_count=vm_instance.update_count+1,
vm_instance.last_host_id=2, vm_instance.state='Stopped', vm_instance.host_id=null, vm_instance.pod_id=1
WHERE vm_instance.id = 10  AND vm_instance.state = 'Stopping'  AND vm_instance.host_id = 2
 AND vm_instance.update_count = 54
633 Query     UPDATE vm_instance SET vm_instance.update_time='2014-07-24 15:00:34', vm_instance.update_count=vm_instance.update_count+1,
vm_instance.state='Running', vm_instance.host_id=2, vm_instance.pod_id=1 WHERE vm_instance.id
= 10  AND vm_instance.state = 'Stopping'  AND vm_instance.host_id = 2  AND vm_instance.update_count
= 54

As can be seen ACS tries to update the entry twice but twice looking vm_instance.update_count
field with a value of 54 
So the first update causes the counter vm_instance.update_count increased to 55 so once the
second SQL Update command executed incorrectly.

My guess is that the situation is related to a race condition in the HA-Worker threads, which
by default is 5 

In the test environment, we try to change ha.workers value to 1 and we will see if the situation
repeats itself.


>  CloudStack stops the machine for no reason
> -------------------------------------------
>
>                 Key: CLOUDSTACK-6036
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6036
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.2.1
>         Environment: ACS 4.2.1 after upgrade from 3.0.2
> ACS 4.2.1 clean install
> XCP 1.1
>            Reporter: Tomasz Zieba
>            Assignee: Koushik Das
>            Priority: Critical
>              Labels: 4.5ReviewNeeded
>             Fix For: 4.4.0
>
>         Attachments: management-server.log.2014-02-19.gz, management-server.log.2014-02-20.gz,
management-server.log.2014-02-24.txt
>
>
> After the upgrade from version 3.0.2 to 4.2.1 appeared very strange error associated
with self-stopping machine after changing the offering. 
> Steps to reproduce: 
> 1. Running instance of the machine 
> 2. Stop with the operating system 
> 3. Change offering of machine
> 4. Start the machine 
> 5. Waiting for 10 minutes 
> 6. CloudStack stops the machine for no reason
> 7. Restart the machine 
> In the logs you can find information:
> 2014-02-05 06:27:00,974 DEBUG [xen.resource.CitrixResourceBase] (DirectAgent-316:null)
11. The VM i-41-824-VM is in Running state
> 2014-02-05 06:27:00,974 DEBUG [agent.transport.Request] (DirectAgent-316:null) Seq 50-1756626952:
Processing:  { Ans: , MgmtId: 160544475005497, via: 50, Ver: v1, Flags: 10, [{"com.cloud.agent.api.ClusterSyncAnswer":{"_clusterId":2,"_newStates":{"i-41-824-VM":{"t":"f32a4fee-b64e-4868-9c52-2a27e32d4c0e","u":"Running","v":"viridian:true;acpi:true;apic:true;pae:true;nx:false;timeoffset:0;cores-per-socket:4"}},"_isExecuted":false,"result":true,"wait":0}}]
}
> 2014-02-05 06:27:00,981 DEBUG [cloud.vm.VirtualMachineManagerImpl] (DirectAgent-316:null)
VM i-41-824-VM: cs state = Running and realState = Running
> 2014-02-05 06:27:00,981 DEBUG [cloud.vm.VirtualMachineManagerImpl] (DirectAgent-316:null)
VM i-41-824-VM: cs state = Running and realState = Running
> 2014-02-05 06:36:01,240 DEBUG [agent.transport.Request] (HA-Worker-1:work-1511) Seq 51-1374970375:
Sending  { Cmd , MgmtId: 160544475005497, via: 51, Ver: v1, Flags: 100111, [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":true,"vmName":"i-41-824-VM","wait":0}}]
}
> 2014-02-05 06:36:01,240 DEBUG [agent.transport.Request] (HA-Worker-1:work-1511) Seq 51-1374970375:
Executing:  { Cmd , MgmtId: 160544475005497, via: 51, Ver: v1, Flags: 100111, [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":true,"vmName":"i-41-824-VM","wait":0}}]
}
> 2014-02-05 06:36:01,383 DEBUG [xen.resource.CitrixResourceBase] (DirectAgent-150:null)
9. The VM i-41-824-VM is in Stopping state
> 2014-02-05 06:36:27,625 DEBUG [xen.resource.CitrixResourceBase] (DirectAgent-150:null)
10. The VM i-41-824-VM is in Stopped state
> You will notice that the stop of the machine corresponds to the component HA-Worker.

> Such operation after the upgrade is complicating work of users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message