cloudstack-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (CLOUDSTACK-9350) Local storage hosts get HA tasks, cause issues
Date Mon, 09 May 2016 04:09:12 GMT


ASF GitHub Bot commented on CLOUDSTACK-9350:

Github user swill commented on the pull request:
    ### CI RESULTS
    Tests Run: 88
      Skipped: 2
       Failed: 1
       Errors: 1
     Duration: 11h 25m 09s
    **Summary of the problem(s):**
    ERROR: Test to verify access to loadbalancer haproxy admin stats page
    Traceback (most recent call last):
      File "/data/git/cs1/cloudstack/test/integration/smoke/", line 854,
in tearDown
        raise Exception("Cleanup failed with %s" % e)
    Exception: Cleanup failed with Job failed: {jobprocstatus : 0, created : u'2016-05-07T12:50:26+0200',
jobresult : {errorcode : 530, errortext : u'Failed to delete network'}, cmd : u'',
userid : u'b90ec272-1410-11e6-9152-5254001daa61', jobstatus : 2, jobid : u'04de60c8-0aa7-4488-a076-a7475b147b47',
jobresultcode : 530, jobresulttype : u'object', jobinstancetype : u'Network', accountid :
    Additional details in: /tmp/MarvinLogs/test_network_9UCT1L/results.txt
    FAIL: Test create, assign, remove of an Internal LB with roundrobin http traffic to 3
vm's in a Single VPC
    Traceback (most recent call last):
      File "/data/git/cs1/cloudstack/test/integration/smoke/", line 599,
in test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80
      File "/data/git/cs1/cloudstack/test/integration/smoke/", line 668,
in execute_internallb_roundrobin_tests
      File "/data/git/cs1/cloudstack/test/integration/smoke/", line 519,
in setup_http_daemon"Failed to ssh into vm: %s due to %s" % (vm, e))
    AssertionError: Failed to ssh into vm: <marvin.lib.base.VirtualMachine instance at
0x3624170> due to not all arguments converted during string formatting
    Additional details in: /tmp/MarvinLogs/test_network_9UCT1L/results.txt
    **Associated Uploads**
    * [dc_entries.obj](
    * [failed_plus_exceptions.txt](
    * [runinfo.txt](
    * [failed_plus_exceptions.txt](
    * [results.txt](
    * [runinfo.txt](
    * [failed_plus_exceptions.txt](
    * [results.txt](
    * [runinfo.txt](
    * [failed_plus_exceptions.txt](
    * [results.txt](
    * [runinfo.txt](
    Uploads will be available until `2016-07-09 02:00:00 +0200 CEST`
    *Comment created by [`upr comment`](*

> Local storage hosts get HA tasks, cause issues	
> -----------------------------------------------
>                 Key: CLOUDSTACK-9350
>                 URL:
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>    Affects Versions: 4.5.1
>            Reporter: Abhinandan Prateek
>            Assignee: Abhinandan Prateek
> When a host hits its ping time out, for whatever reason, the investigators are triggered.
The KVMInvestigator sends a CheckOnHostCommand to the target host, and then to all the remaining
neighbor hosts in the cluster. The CheckOnHostCommand (and also FenceCommand, the code is
nearly identical) is processed by the KVM agent and simply scans through all NFS primary storage
looking for the host's heartbeat in the KVMHA directory. If no heartbeat file is found, it
fails the check. In the case of clusters that are local-only, these hosts will always fail
the check, whether it be the target host or a neighbor checking on the target. This triggers
a host 'down' event, which triggers HA tasks. The HA tasks will attempt to stop any VMs on
the host, and then if the VM's offering is HA-enabled it will try to restart the VM.
> Our recent issue was that a management server took extraordinarily long to rotate its
logs and was slow to process some host pings. The CheckOnHostCommand was sent to a suspect
host, which failed because it had no primary NFS. The neighbor checks also failed to check
the suspect host's heartbeat for the same reason. Then the host was marked as down and all
VMs were stopped. Multiply this by a few dozen hosts.
> The immediate fix, provided in the example, is a patch to KVMInvestigator which will
only attempt investigation if the host's cluster has NFS storage, which is a requirement for
the host to run the check, as described above. If there is none, the host state is determined
to be disconnected rather than down. This means that the host will still end up in alert state
and need manual investigation, but there will be no attempt to stop or HA the VMs.
> Additionally, the patch catches scenarios where a cluster might have both NFS and local
storage and a host ends up in 'down' state. In this case, when the HA tasks are being created,
if a VM is using local storage then the HA task generation is skipped. This VM can't be started
anywhere else.
> We could also make the agent side more robust, in KVMHAChecker we may not want it to
return 'false' if there were zero pools passed to check for HA heartbeat. Then again, maybe
we do. We decided initially to patch just the server side, because it is easier to deploy.
> In the long run, I'd hope that the current HA work would supercede the current KVMInvestigator
and take the cluster's ability to pass any defined checks into account before checking.

This message was sent by Atlassian JIRA

View raw message