cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CLOUDSTACK-10310) KVM hosts reboot if there is a short transient storage error
Date Mon, 08 Oct 2018 14:58:00 GMT

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641952#comment-16641952
] 

ASF GitHub Bot commented on CLOUDSTACK-10310:
---------------------------------------------

csquire edited a comment on issue #2722: CLOUDSTACK-10310 Fix KVM reboot on storage issue
URL: https://github.com/apache/cloudstack/pull/2722#issuecomment-427859929
 
 
   This PR doesn't seem to completely fix the problem (or maybe this is a completely new problem).
We installed the RC release with this PR on a test system and are able to get the KVM host
to be marked as `Down` by using iptables to drop outgoing requests to NFS. My investigation
shows that the line [`storage = conn.storagePoolLookupByUUIDString(uuid);`](https://github.com/apache/cloudstack/blob/4.11/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java#L95)
blocks indefinitely.  So, `kvmheartbeat.sh` is never executed, a host investigation is started,
the host with blocked NFS is marked as `Down` and finally all VMs on that host are rescheduled
and result in duplicate VMs.
   
   I pulled a thread dump and found the KVMHAMonitor thread will hang here until NFS is unblocked,
didn't dig any deeper yet though.
   
   ```"Thread-20" - Thread t@135
      java.lang.Thread.State: RUNNABLE
           at com.sun.jna.Native.invokePointer(Native Method)
           at com.sun.jna.Function.invokePointer(Function.java:470)
           at com.sun.jna.Function.invoke(Function.java:404)
           at com.sun.jna.Function.invoke(Function.java:315)
           at com.sun.jna.Library$Handler.invoke(Library.java:212)
           at com.sun.proxy.$Proxy3.virStoragePoolLookupByUUIDString(Unknown Source)
           at org.libvirt.Connect.storagePoolLookupByUUIDString(Unknown Source)
           at com.cloud.hypervisor.kvm.resource.KVMHAMonitor$Monitor.runInContext(KVMHAMonitor.java:95)
           - locked <1afb3370> (a java.util.concurrent.ConcurrentHashMap)
           at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
           at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
           at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
           at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
           at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
           at java.lang.Thread.run(Thread.java:748)
   
      Locked ownable synchronizers:
           - None```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> KVM hosts reboot if there is a short transient storage error
> ------------------------------------------------------------
>
>                 Key: CLOUDSTACK-10310
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
>             Project: CloudStack
>          Issue Type: Improvement
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: KVM
>    Affects Versions: 4.9.0, 4.10.0.0
>            Reporter: Sean Lair
>            Priority: Major
>
> If the KVM heartbeat file can't be written to, the host is rebooted, and thus taking
down all VMs running on it.  The code does try 5x times before the reboot, but the there
is not a delay between the retires, so they are 5 simultaneous retries, which doesn't help. 
Standard SAN storage HA operations or quick network blip could cause this reboot to occur.
> Some discussions on the dev mailing list revealed that some people are just commenting
out the reboot line in their version of the CloudStack source.
> A better option (and a new PR is being issued) would be have it sleep between tries so
it isn't 5x almost simultaneous tries.  Plus, instead of rebooting, the cloudstack-agent
could just be stopped on the host instead.  This will cause alerts to be issued and if the
host is disconnected long-enough, depending on the HA code in use, VM HA could handle the
host failure.
> The built-in reboot of the host seemed drastic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message